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eration  of  constraint  generation  yields  the  constraint  indicated  by  the  line 
labeled  ‘generated  constraint’,  and  (in  this  case)  leads  to  a  stable  solution 
A*.  The  final  step  of  our  algorithm  improves  on  this  solution  by  inter¬ 
polating  A*  with  the  previous  solution  (in  this  case.  A)  to  obtain  A*j^^^i. 

(B):  The  actual  stable  and  unstable  regions  for  the  space  of  2  x  2  matrices 
Ea,0  =  [0.3  a  ;  /3  0.3  ],  with  a,l3  e  [—10, 10].  Constraint  generation  is 
able  to  learn  a  nearly  optimal  model  from  a  noisy  state  sequence  of  length 
7  simulated  from  Eq^iq,  with  better  state  reconstruction  error  than  either 
EB-1  orEB-2.  . .  65 
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Abstract 


A  variety  of  learning  problems  in  roboties,  eomputer  vision  and  other  ar¬ 
eas  of  artificial  intelligence  can  be  construed  as  problems  of  learning  statistical 
models  for  dynamical  systems  from  sequential  observations.  Good  dynamical 
system  models  allow  us  to  represent  and  predict  observations  in  these  systems, 
which  in  turn  enables  applications  such  as  classification,  planning,  control, 
simulation,  anomaly  detection  and  forecasting.  One  class  of  dynamical  sys¬ 
tem  models  assumes  the  existence  of  an  underlying  hidden  random  variable 
that  evolves  over  time  and  emits  the  observations  we  see.  Past  observations 
are  summarized  into  the  belief  distribution  over  this  random  variable,  which 
represents  the  state  of  the  system.  This  assumption  leads  to  ‘latent  variable 
models’  which  are  used  heavily  in  practice.  However,  learning  algorithms  for 
these  models  still  face  a  variety  of  issues  such  as  model  selection,  local  op¬ 
tima  and  instability.  The  representational  ability  of  these  models  also  differs 
significantly  based  on  whether  the  underlying  latent  variable  is  assumed  to 
be  discrete  as  in  Hidden  Markov  Models  (HMMs),  or  real-valued  as  in  Lin¬ 
ear  Dynamical  Systems  (LDSs).  Another  recently  introduced  class  of  models 
represents  state  as  a  set  of  predictions  about  future  observations  rather  than 
as  a  latent  variable  summarizing  the  past.  These  ‘predictive  models’,  such  as 
Predictive  State  Representations  (PSRs),  are  provably  more  powerful  than  la¬ 
tent  variable  models  and  hold  the  promise  of  allowing  more  accurate,  efficient 
learning  algorithms  since  no  hidden  quantities  are  involved.  However,  this 
promise  has  not  been  realized. 

In  this  thesis  we  propose  novel  learning  algorithms  that  address  the  issues 
of  model  selection,  local  minima  and  instability  in  learning  latent  variable 
models.  We  show  that  certain  ’predictive’  latent  variable  model  learning  meth¬ 
ods  bridge  the  gap  between  latent  variable  and  predictive  models.  We  also 
propose  a  novel  latent  variable  model,  the  Reduced-Rank  HMM  (RR-HMM), 
that  combines  desirable  properties  of  discrete  and  real-valued  latent- variable 
models.  We  show  that  reparameterizing  the  class  of  RR-HMMs  yields  a  sub¬ 
set  of  PSRs,  and  propose  an  asymptotically  unbiased  predictive  learning  algo¬ 
rithm  for  RR-HMMs  and  PSRs  along  with  finite-sample  error  bounds  for  the 
RR-HMM  case.  In  terms  of  efficiency  and  accuracy,  our  methods  outperform 
alternatives  on  dynamic  texture  videos,  mobile  robot  visual  sensing  data,  and 
other  domains. 
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Chapter  1 


Introduction 


Modeling  of  dynamical  systems  is  an  important  aspect  of  robotics,  artificial  intelligence 
and  statistical  machine  learning.  Such  modeling  is  typically  based  on  observations  that 
arise  from  the  dynamical  system  over  time.  A  distinguishing  characteristic  of  dynamical 
systems  is  that  their  observations  exhibit  temporal  correlations,  modeling  of  which  is  the 
main  challenge  of  dynamical  systems  analysis.  Accurate  models  of  dynamical  systems 
allow  us  to  perform  a  variety  of  useful  tasks,  such  as  prediction,  simulation,  recognition, 
classification,  anomaly  detection  and  control.  In  this  thesis  we  focus  on  uncontrolled  dy¬ 
namical  systems  with  multivariate  real-valued  observations.  This  thesis  contributes  (1) 
novel  learning  algorithms  for  existing  dynamical  system  models  that  overcome  significant 
limitations  of  previous  methods,  (2)  a  deeper  understanding  of  some  important  distinctions 
between  different  dynamical  system  models  based  on  differences  in  their  underlying  as¬ 
sumptions  and  in  their  learning  algorithms,  and  (3)  a  novel  model  that  combines  desirable 
properties  of  several  existing  models,  along  with  inference  and  learning  algorithms  which 
have  theoretical  performance  guarantees. 

Two  major  approaches  for  modeling  dynamical  systems  in  machine  learning  are  Latent 
Variable  Models  (LVMs)  and  predictive  models,  which  have  different  benefits  and  draw¬ 
backs.  An  LVM  for  dynamical  systems  assumes  its  observations  are  noisy  emissions  from 
an  underlying  latent  variable  that  evolves  over  time  and  represents  the  state  of  the  system. 
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In  other  words,  the  latent  state  is  a  suffieient  statistie  for  all  past  observations.  LVMs  prob- 
abilistieally  model  both  the  latent  variable’s  evolution  and  the  relationship  between  latents 
and  observables.  Typieal  parameter  learning  algorithms  for  LVMs  (sueh  as  the  Expeeta- 
tion  Maximization  (EM)  algorithm  and  related  methods)  are  prone  to  local  optima  of  their 
objeetive  funetions,  and  so  eannot  provide  eonsistent  parameter  estimates  with  reasonable 
amounts  of  eomputation  espeeially  for  large  models,  sinee  multiple  restarts  are  required  to 
seareh  the  spaee  of  loeal  optima.  In  eontrast,  predietive  models  (sueh  as  Predietive  State 
Representations  (PSRs))  and  their  learning  algorithms  define  the  state  of  a  dynamieal  sys¬ 
tem  as  a  set  of  predictions  of  expeeted  values  of  statisties  of  the  future,  ealled  tests.  Sinee 
there  are  no  latent  or  “hidden”  quantities  involved,  learning  algorithms  for  predietive  mod¬ 
els  (whieh  typieally  rely  on  matrix  faetorization  rather  than  on  EM)  ean  yield  consistent 
parameter  estimates,  though  guaranteeing  well-formed  parameters  with  finite  samples  is 
often  a  ehallenge.  Researeh  from  eontrol  theory  as  well  as  reeent  work  in  statistieal  learn¬ 
ing  theory  (ineluding  parts  of  this  thesis)  have  blurred  the  distinetion  between  EVMs  and 
predietive  models  by  showing  that  EVMs  ean  be  learned  in  a  globally  optimal  fashion  with 
predietive  algorithms,  allowing  us  to  interpret  their  latent  variables  as  tests. 

The  two  best-known  examples  of  EVMs  for  eontinuous-observation  dynamieal  systems 
are  Hidden  Markov  Models  (HMMs)  (Chapter  2)  and  Linear  Dynamical  Systems  (LDSs) 
(Chapter  3).  Other  EVMs  for  dynamieal  systems  are  often  based  on  one  or  both  of  these 
two  models.  We  deseribe  important  properties  of  HMMs  and  EDSs  in  more  detail  below. 

HMMs  assume  a  discrete  latent  variable  that  ean  take  on  finitely  many  values,  eaeh 
eharaeterized  by  a  unique  probability  distribution  over  observations.  These  assumptions 
allow  HMMs  to  model  a  large  variety  of  predietive  distributions  during  forward  simulation 
(ineluding  predietive  distributions  whieh  are  not  log-eoneave),  whieh  is  an  advantage  for 
modeling  a  variety  of  real-world  dynamieal  systems.  We  will  use  the  term  competitive 
inhibition  to  denote  the  ability  of  a  model  (sueh  as  the  HMM)  to  represent  non-log-eoneave 
predietive  distributions.  A  model  that  performs  eompetitive  inhibition  ean  probability 
mass  on  distinet  observations  while  disallowing  mixtures  of  those  observations.  However, 
the  diserete  nature  of  the  HMM’s  latent  state  makes  it  diffieult  to  model  smoothly  evolving 
dynamieal  systems,  whieh  are  also  eommon  in  praetiee.  Another  diffieulty  with  HMMs 


2 


is  the  problem  of  model  selection,  or  determining  the  eorreet  number  of  states  and  the 
strueture  (e.g.  sparsity)  of  the  transition  and  observation  funetions. 

On  the  other  hand,  LDSs  assume  a  continuous  latent  variable  that  evolves  linearly 
with  Gaussian  noise,  and  a  Gaussian  observation  distribution  whose  mean  is  linear  in 
the  latent  variable.  These  assumptions  make  LDSs  adept  at  modeling  smoothly  evolving 
dynamieal  systems  but  unable  to  perform  eompetitive  inhibition.  The  inability  to  handle 
eompetitive  inhibition  stems  from  the  faet  that  the  predietive  distribution  is  always  log- 
eoneave;  therefore  any  eonvex  eombination  of  likely  observations  will  also  be  likely.  Also 
unlike  HMMs,  matrix-faetorization-based  approaehes  to  learning  LDSs  make  it  easy  to 
perform  model  seleetion.  Another  distinetion  from  HMMs  is  that  eonventional  learning 
algorithms  for  the  LDS  do  not  guarantee  stable  parameters  for  modeling  its  dynamies. 
This  ean  be  either  a  benefit  or  a  drawbaek,  sinee  the  system  to  be  modeled  may  be  either 
unstable  or  stable.  However,  all  of  the  systems  that  we  eonsider  modeling  in  this  thesis  are 
stable,  so  we  eonsider  it  a  drawbaek  when  a  learning  algorithm  returns  an  unstable  set  of 
parameters. 

In  this  thesis  we  advanee  the  theory  and  praetiee  of  learning  dynamieal  system  models 
from  data  in  several  ways.  We  first  address  the  tendeney  of  HMM  learning  algorithms  to 
get  stuek  in  loeal  optima,  and  the  need  to  pre-define  the  number  of  states:  we  develop  Si¬ 
multaneous  Temporal  and  Contextual  Splitting  (STAGS),  a  novel  EM-based  algorithm  for 
performing  both  model  seleetion  and  parameter  learning  effieiently  in  Gaussian  HMMs 
while  avoiding  loeal  minima  (Chapter  4).  Results  show  improved  learning  performanee 
on  a  wide  variety  of  real-world  domains.  We  next  address  a  defieieney  in  eonventional 
LDS  learning  algorithms:  we  propose  a  matrix-faetorization-based  learning  algorithm  for 
LDSs  that  uses  constrained  optimization  to  guarantee  stable  parameters  and  yields  su¬ 
perior  results  in  simulation  and  predietion  of  a  variety  of  real-world  dynamieal  systems 
(Chapter  5).  Finally,  we  address  the  more  ambitious  goal  of  bridging  the  gap  between 
models  that  ean  perform  eompetitive  inhibition  and  models  that  ean  represent  smoothly 
evolving  systems:  we  propose  the  Reduced-Rank  Hidden  Markov  Model  (RR-HMM),  a 
model  that  ean  do  both  the  above  (Chapter  6).  We  investigate  its  relationship  to  existing 
models,  and  propose  a  predietive  learning  algorithm  along  with  theoretieal  performanee 
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guarantees.  We  demonstrate  results  on  a  variety  of  high-dimensional  real-world  data  sets, 
including  vision  sensory  output  from  a  mobile  robot. 
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Chapter  2 


Hidden  Markov  Models 


Hidden  Markov  Models  (HMMs)  are  LVMs  where  the  underlying  hidden  variable  ean 
take  on  one  of  finitely  many  discrete  values.  Introdueed  in  the  late  1960s,  HMMs  have 
been  used  most  extensively  in  speeeh  reeognition  [4,  5]  and  bioinformaties  [6]  but  also  in 
diverse  applieation  areas  sueh  as  eomputer  vision  and  information  extraetion  [7,  8].  For 
an  exeellent  tutorial  on  HMMs,  see  Rabiner  [4].  In  this  ehapter  we  define  HMMs  and 
deseribe  their  standard  inferenee  and  learning  algorithms. 


2.1  Definition 

Let  ht  G  1, . . . ,  m  denote  the  diserete  hidden  states  of  an  HMM  at  time  t,  and  Xt  G  1, . . . ,  n 
denote  the  observations.  These  ean  be  either  diserete  or  eontinuous — we  will  speeify 
our  assumptions  explieitly  for  different  instanees.  Let  T  G  be  the  state  transition 

probability  matrix  with  its  [f,  entry  Tij  having  value  Pr[ht+i  =  i  \  ht  =  j].  O  is  the 
eolumn-stoehastie  observation  model  sueh  that  0{i,j)  =  Pr[a:t  =  i  \  ht  =  j].  For  diserete 
observations,  O  is  a  matrix  of  observation  probabilities  of  size  n  x  m,  and  0{i,j)  =  Oij 
denotes  the  [f ,  entry  of  O.  For  eontinuous  observations,  0{x^  j)  denotes  the  probability 
of  observation  x  under  a  Gaussian  distribution  speeifie  to  state  j,  i.e.  0{x,j)  =  J\f{x  \ 
fij,  Sj).  TT  G  is  the  initial  state  distribution  with  vfi  =  Pr[/ii  =  i].  We  use  A  to  denote 
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the  entire  set  of  HMM  parameters  {T,  0,7i}.  Figure  2.1  illustrates  the  graphical  model 
corresponding  to  an  HMM. 

Let  ht  G  denote  the  system’s  belief  state,  i.e.  a  distribution  over  hidden  states  at 
time  t.  If  we  use  e,  to  denote  the  column  of  the  identity  matrix,  then  ht  is  equivalent  to 
the  conditional  expectation  of  Cht,  with  the  conditioning  variables  clear  from  context.  We 
use  the  term  path  to  denote  a  sequence  of  hidden  states  H  =  /z.i,  •  ■  ■ ,  corresponding 

to  a  sequence  of  observations  X  =  xi,X2, ...  ,Xr  from  an  HMM. 

Computing  the  probability  of  a  sequence  of  observations  with  an  HMM  is  very  simple. 
Note  that  Pr[a:i]  can  be  expressed  as  a  chain  product  of  HMM  parameters.  Let  Ox,,  denote 
the  row  of  O  containing  probabilities  for  observation  x  under  each  possible  state.  Now, 
define  the  parameters  Ax  as 


Ax  =  T  diag(Oa;,.) 


(2.1) 


Then, 

Pr[a:i]  =  ^Pr[xi  \hi  =  g]  Vi[hi  =  g\  =  ll^  diag(0^i,)7f 

9 

Similarly,  Pr[xia;2  . .  .xf  can  be  expressed  as 

I  hx  =  gf  PT[hx  =  gr  \  K-i  =  5'r-i]  ■  ■  ■  Pr[xi  |  hi  =  gf  Pr[/ri  =  gf 

=  r^T  diag(02;^,.)T  diag(0^^_,,.)  ■  ■  ■  T  diag(0^,,.)7r 

“  ^m-^XT-^XT-i-^x^T-2  ■  ■  ■  (2.2) 

We  now  describe  standard  algorithms  for  filtering  (forward  inference),  smoothing  (back¬ 
ward  inference),  path  inference,  learning  and  model  selection  in  HMMs. 


2.2  Filtering  and  Smoothing 

Filtering  is  the  process  of  maintaining  a  belief  distribution  while  incrementally  condition¬ 
ing  on  observations  over  time  in  the  forward  direction,  from  past  to  present.  Define  the 
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Figure  2.1:  The  graphical  model  representation  of  an  HMM. 
forward  variable  a{t,i)  as 


a{t,  i)  =  Pr[a;i, . . .  ,Xt,ht  =  i  \  X] 


Then,  the  filtering  belief  distribution  can  be  written  using  Bayes  rule  as  the  vector  a(t,  •)  = 
[a(f,  1) . . .  a{t,  m)]^  where  the  element  is: 


a{t,  i)  =  PT[ht  =  i  \  xi, . . .  ,Xt,X\ 


a(t,  i) 


The  values  of  a  {t,i)  for  all  f,  z  can  be  computed  inductively  according  to  the  following 
equation  [4]: 


a(t,i)  =  0{i,xt)'^a(t  - 

3 

The  corresponding  vector  update  equation  is: 

a(f,  •)=  diag(0(:,a;i))TQ;(f  -  1,  •) 

Given  the  filtering  belief  probability  of  being  in  state  z  at  time  t  ,  the  corre¬ 

sponding  probability  at  time  t  +  1  can  be  obtained  directly  in  the  following  two  steps: 


7 


PT[ht+i  =  i  \  xi, . . .  ,xt] 

=  ^  Pr[/it+i  =  i  \  ht  =  j]  Pr[/it  =  j  \  xi, . . . ,  Xt]  (prediction  step) 
j 

j 

a{t  +  1,  i)  =  Pr[ht+i  =  i  \  Xi, . . . ,  Xt,  Xt+i] 

Pr[a;t+i  |  ht+i  =  i]  PT[ht+i  =  i  \  xi, . . .  ,Xt]  ,  ,  ^  ^  , 

=  Y-  p  r - n, - - n - T  (update  step) 

2^  .  Pr[xt+i  ht+i  =  j\  PT[ht+i  =  J  \xi,...,xt 


0{i,xt+i)  Pr[/it+i  =  i\xi,.. 

■,Xt] 

Ej  xt+i)  Pr[/it+i  =  j\xi, 

...,Xt] 

The  forward  variables  also  allow  easy  computation  of  the  observation  sequence  prob¬ 
ability: 

Pr[X  I  A]  =  ^PT[xi,...,Xr,K  =  j  I  A]  =  ^a(r,j) 

3  j 

In  contrast  to  filtering,  smoothing  in  HMMs  is  the  process  of  maintaining  a  belief  distrib¬ 
ution  over  the  present  based  on  observations  in  the  future.  Define  the  backward  variable 

/3{t,  i)  as 

j3{t,  i)  =  Pi[ht  =  i  I  xt+i,  ...,Xr] 

The  value  of  (3{t,  i)  can  also  be  updated  inductively  as  follows. 

=  ^0{i,Xt+i)Tjil3(t  +  l,i) 

3 

The  corresponding  vector  update  equation  is: 

■)  =  di<ig{0{:,xt))p{t  +  1,  •) 

The  process  of  computing  the  forward  and  backward  variables  from  an  observation  se¬ 
quence  using  filtering  and  smoothing  as  described  above  is  known  as  forward-backward 
algorithm.  Its  running  time  is  0{Tm?').  Together,  the  forward  and  backward  variables 
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allow  us  to  compute  the  posterior  stepwise  beliefs  and  posterior  stepwise  transition  prob¬ 
abilities  for  an  entire  observation  sequenee,  whieh  are  denoted  by  7(t,  i)  and 
respeetively: 


7(f,  i)  =  Vi[ht  =  i\xi,...,Xr\  = 


a(t,  i)(3(t,  i) 

Ft[X  I  A] 

a{t,i)Tji/3{t  +  l,j)0{j,Xt+i) 


=FT[ht  =  i,ht+i  =j\xi,...,Xr]=  ^ 

As  before,  the  denominators  ean  be  eomputed  quiekly  by  summing  over  the  numerator. 
These  variables  will  be  useful  when  deseribing  parameter  learning  algorithms  for  HMMs. 


2.3  Path  Inference 

Path  inferenee  is  the  task  of  eomputing  a  path  eorresponding  to  a  given  observation  se- 
quenee  for  a  given  HMM,  sueh  that  the  joint  likelihood  of  path  and  observation  sequenee 
is  maximized.  Let  A  denote  the  HMM  and  X  denote  the  sequenee  of  observations.  Then, 
path  inferenee  eomputes  H*  sueh  that 

H*  =  argmaxPr[X,  if  |  A] 

H 

For  an  observation  sequenee  of  length  r,  the  Viterbi  algorithm  [9]  eomputes  an  optimal 
path  in  running  time  0{Tm?)  using  dynamie  programming. 

Define  as 

i)  =  max  Fi[hih2  ■  ■  ■  ht  =  i,  X1X2  ■  ■  ■  Xt  \  A] 

hi,--- ,ht-i 

Though  eomputing  for  alH,i  naively  would  have  running  time  that  is  exponential 

in  r,  it  ean  be  eomputed  induetively  in  a  more  effieient  fashion.  The  induetive  formula  for 
used  in  the  Viterbi  algorithm  is 

0{j,xt+i) 

Sinee  there  is  a  maximization  over  m  terms  earried  out  for  eaeh  state  per  timestep,  and 
there  are  mxr  5  {t,i)  values  to  be  ealeulated,  the  total  running  time  of  the  Viterbi  algorithm 
is 
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2.4  Learning  Hidden  Markov  Models 


A  large  body  of  research  exists  on  algorithms  for  learning  HMM  parameters  from  data.  We 
focus  on  two  of  the  most  common  techniques  here,  namely  Baum-Welch  and  Viterbi  Train¬ 
ing.  These  are  both  iterative  methods  analogous  to  the  popular  /c-means  algorithm  [10,  11] 
for  clustering  independent  and  identically  distributed  (IID)  data,  in  the  sense  that  they 
monotonically  minimize  a  distortion  function  of  the  data  with  respect  to  a  fixed  number  of 
“centers”  (here,  states)  using  successive  iterations  of  computing  distances  to  these  centers 
and  updating  these  centers  to  better  positions.  Since  HMMs  model  sequential  data,  there 
is  an  additional  dimension  to  these  learning  algorithms,  namely  the  order  in  which  data- 
points  tend  to  appear  in  the  training  sequence,  which  is  modeled  by  the  HMM’s  transition 
matrix. 

The  more  recently  developed  spectral  learning  algorithm  for  HMMs  [12,  13]  relies  on 
a  Singular  Value  Decomposition  (SVD)  of  a  correlation  matrix  of  past  and  future  obser¬ 
vations  to  derive  an  observable  representation  of  an  HMM.  We  describe  this  algorithm  in 
detail  in  Chapter  6. 


2.4.1  Expectation  Maximization 

Given  one  or  several  sequences  of  observations  and  a  desired  number  of  states  m,  we  can  fit 
an  HMM  to  the  data  using  an  instance  of  EM  [14]  called  Baum-Welch  [15]  which  was  dis¬ 
covered  before  the  general  EM  algorithm.  Baum-Welch  alternates  between  steps  of  com¬ 
puting  a  set  of  expected  sufficient  statistics  from  the  observed  data  (the  E-step)  and  updat¬ 
ing  the  parameters  using  estimates  computed  from  these  statistics  (the  M-step).  The  main 
advantage  of  Baum-Welch  is  that  these  closed-form  iterations  are  guaranteed  to  monoton¬ 
ically  converge  to  an  optimum  of  the  observed  data  log-likelihood  //x  (A)  =  log  Pr  [X  |  A] . 
The  disadvantage  is  that  it  is  only  guaranteed  to  reach  a  local  optimum,  and  there  are  no 
guarantees  about  reaching  the  global  optimum.  In  practice,  this  issue  is  often  addressed  by 
running  EM  several  times  starting  from  different  random  parameter  initializations.  How¬ 
ever,  as  the  number  of  states  increases,  the  algorithm  is  increasingly  prone  to  local  optima 
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to  an  extent  that  is  difficult  to  overcome  by  multiple  restarts.  The  algorithm  can  be  sum¬ 
marized  as  follows  for  a  given  training  sequence  X  =  {xi,X2, Xr),  for  both  discrete 
and  continuous  observations  (assuming  multinomial  and  Gaussian  observation  models  re¬ 
spectively): 


1.  Initialize  A  =  (tt,  T,  O)  randomly  to  valid  values  (i.e.  preserving  non-negativity  and 
stochasticity  where  needed). 

2.  Repeat  while  log-likelihood  ll{X  \  A)  increases  by  more  than  some  threshold  e: 


(a)  E-step:  Use  forward-backward  algorithm  on  A  and  X  to  compute  a{t,  i),P(t,  i) 
for  all  t,  i  and  from  these  compute  7(f,  i)  and  i,  j)  for  all  t,  i,  j. 

(b)  M-step:  Compute  updated  parameter  estimates  A  =  ()f ,  T,  O)  as  follows: 


nt,j) 


7(1, i)  Vi 

Er'e(vju) 


Vi,i 


Multinomial  observation  model: 

0(i,  x) 

Gaussian  observation  model: 

Jij 


E[7(tu) 


ELi7(Vi)  -3;^ 

ELi7(t,j)  ^ 

Et=i 7(V i)  •  {xt  -  iij){xt  -  iijY 

E[=i7(Vj) 


(c)  A  ^  A 

3.  Return  final  parameter  estimates  A 
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2.4.2  Viterbi  Training 


Viterbi  Training  is  the  hard-updates  analogue  of  Baum-Weleh,  in  the  sense  that  the  E-step 
approximates  the  posterior  stepwise  belief  and  transition  probability  distributions  7  and 
^  with  delta  funetions  at  a  partieular  state  and  transition  at  every  timestep.  The  partieu- 
lar  state  and  transition  ehosen  at  eaeh  timestep  are  the  state  and  transition  in  the  Viterbi 
path  at  that  time.  The  M-step  therefore  sets  transition  and  observation  probabilities  based 
on  eounts  eomputed  from  the  Viterbi  path.  To  update  the  prior  tt,  if  training  is  being  per¬ 
formed  using  several  observation  sequenees  the  prior  is  based  on  the  distribution  of  h\  over 
these  sequenees.  For  a  single  training  sequenee,  it  is  best  to  set  the  prior  to  be  uniform 
rather  than  setting  it  to  be  a  delta  funetion  at  h\,  though  intermediate  ehoiees  are  also  pos¬ 
sible  (e.g.  Laplace  Smoothing,  whieh  allows  biased  priors  while  ensuring  no  probability  is 
set  to  zero).  The  steps  of  Viterbi  Training  ean  be  summarized  as  follows: 


1.  Initialize  A  =  (if,  T,  O)  randomly  to  valid  values  (i.e.  preserving  non-negativity  and 
stoehastieity  where  needed). 


2.  Repeat  while  the  Viterbi  path  keeps  ehanging: 


(a)  E-step:  Compute  the  Viterbi  path  H*  =  {h\,h*2, ,  /i*) 

(b)  M-step:  Compute  updated  parameter  estimates  A  =  ()f ,  T,  O)  as  follows: 


ni.]) 


-  Vi 

T 
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Multinomial  observation  model: 


0{i,x)  = 


"T  1 

Z-^t:h^=iAxt=x 


Gaussian  observation  model: 


fXj 


S.  = 


Y.f.ht=j 


Vj 


T.f.hi=Mt  - 


J2t:h*  =  i  ^ 


Vj 


(e)  A  <—  A 


3.  Return  final  parameter  estimates  A 


The  asymptotie  running  time  of  both  Baum-Weleh  and  Viterbi  Training  is  0{Tm?) 
per  iteration.  However,  Viterbi  Training  is  faster  by  a  constant  factor.  Viterbi  Training 
converges  to  a  local  maximum  of  the  complete  data  likelihood  Pr[X,  H  \  X],  which  does 
not  necessarily  correspond  to  a  local  maximum  of  the  observed  data  likelihood  Pr[X  |  A] 
as  is  usually  desired.  In  practice,  Viterbi  Training  is  often  used  to  initialize  the  slower 
Baum- Welch  algorithm  which  does  converge  to  a  local  maximum  of  Pr[X  |  A]. 


2.5  Related  Work 

Recently,  HMMs  and  their  algorithms  have  been  re-examined  in  light  of  their  connec¬ 
tions  to  Bayesian  Networks,  such  as  in  [16].  Many  variations  on  the  basic  HMM  model 
have  also  been  proposed,  such  as  coupled  HMMs  [7]  for  modeling  multiple  interacting 
processes,  Input-Output  HMMs  [17]  which  incorporate  inputs  into  the  model,  hierarchical 
HMMs  [18]  for  modeling  hierarchically  structured  state  spaces,  and  factorial  HMMs  [19] 
that  model  the  state  space  in  a  distributed  fashion.  Another  notable  example  of  a  special¬ 
ized  sub-class  of  HMMs  tailored  for  a  particular  task  is  the  constrained  HMM  [20]  which 
was  developed  originally  in  the  context  of  speech  recognition.  Nonparametric  methods 
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such  as  Hierarchical  Dirichlet  Processes  (HDPs)  [21]  have  been  used  to  define  sampling- 
based  versions  of  HMMs  with  “infinitely”  many  states  [21,  22]  whieh  integrate  out  the 
hidden  state  parameter.  This  elass  of  models  has  sinee  been  improved  upon  in  several 
ways  (e.g.  [23]to  bring  it  eloser  to  a  praetieal  model,  though  it  remains  ehallenging  to 
traetably  perform  learning  or  inferenee  in  these  models  on  large  multivariate  data. 
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Chapter  3 


Linear  Dynamical  Systems 


In  the  case  where  the  state  of  an  LVM  is  multivariate  real-valued  and  the  noise  terms  are 
Gaussian,  the  resulting  model  is  called  a  linear  dynamical  system  (LDS),  also  known  as  a 
Kalman  Filter  [24]  or  a  state-space  model  [25].  LDSs  are  an  important  tool  for  modeling 
time  series  in  engineering,  controls  and  economics  as  well  as  the  physical  and  social  sci¬ 
ences.  In  this  section  we  define  LDSs  and  describe  their  inference  and  learning  algorithms 
as  well  as  review  the  property  of  stability  as  it  relates  to  the  LDS  transition  model,  which 
will  be  relevant  later  in  Chapter  6.  More  details  on  LDSs  and  algorithms  for  inference  and 
learning  in  LDSs  can  be  found  in  several  standard  references  [26,  27,  28,  29]. 

3.1  Definition 

Linear  dynamical  systems  can  be  described  by  the  following  two  equations: 


xt+i  =  Axt  +  wt  wt^  A/'(0,  Q) 
yt  =  Cxt  +  vt  vt  ~  A/'(0,  R) 


(3.1a) 

(3.1b) 


Time  is  indexed  by  the  discrete  index  t.  Here  Xt  denotes  the  hidden  states  in  yt  the 
observations  in  M"*,  and  Wt  and  Vt  are  Gaussian  noise  variables.  In  this  thesis,  we  will 
assume  Wt  and  vt  are  zero-mean,  though  this  may  not  hold  in  general.  Assume  the  initial 
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state  x(0)  =  Xq.  The  parameters  of  the  system  are  the  dynamies  matrix  A  G  the 

observation  model  C  G  and  the  noise  eovarianee  matriees  Q  and  R  denoted  by  the 

following  equation: 


E 


Wt 

Vt 


Q  0 
0  R 


Sts 


(3.2) 


In  this  thesis  we  are  eoneerned  with  uncontrolled  linear  dynamieal  systems,  though,  as 
in  previous  work,  eontrol  inputs  ean  easily  be  ineorporated  into  the  model.  Also  note 
that  in  continuous-time  dynamieal  systems,  whieh  we  also  exelude  from  eonsideration, 
the  derivatives  are  speeified  as  funetions  of  the  eurrent  state.  They  ean  be  approximately 
eonverted  to  diserete-time  systems  via  diseretization. 


3.2  Inference 

In  this  seetion  we  deseribe  the  forwards  and  baekwards  inferenee  algorithms  for  LDSs. 
More  details  ean  be  found  in  several  sourees  [26,  27,  29]. 

The  distribution  over  state  at  time  t,  Vi[Xt  \  can  be  exaetly  eomputed  in  two 

parts:  a  forward  reeursion  whieh  is  dependent  on  the  initial  state  Xq  and  the  observations 
yi:t  known  as  the  Kalman  filter,  and  a  baekward  reeursion  whieh  uses  the  observations 
from  ut  to  yt+i  known  as  the  Rauch-Tung-Striebel  (RTS)  equations.  The  eombined  for¬ 
ward  and  baekward  reeursions  are  together  ealled  the  Kalman  smoother.  Finally,  it  is  worth 
noting  that  the  standard  LDS  filtering  and  smoothing  inferenee  algorithms  [24,  30]  are  in¬ 
stantiations  of  the  junetion  tree  algorithm  for  Bayesian  Networks  on  a  dynamie  Bayesian 
network  analogous  to  the  one  in  Figure  2.1  (see,  for  example,  [31]). 

3.2.1  The  Forward  Pass  (Kalman  Filter) 

Let  the  mean  and  eovarianee  of  the  belief  state  estimate  Pr[Xt  |  yifi  at  time  t  be  denoted 
by  Xt  and  Pt  respeetively.  The  estimates  Xt  and  Pt  ean  be  predieted  from  the  previous  time 
step,  the  exogenous  input,  and  the  previous  observation.  Let  ftqtj  denote  an  estimate  of 
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variable  x  at  time  ti  given  data  yi, ...  ^yt^-  We  then  have  the  following  reeursive  equations: 


xt\t-i  =  Axt-i\t-i  (3.3a) 

Pt\t-i  =  +  Q  (3.3b) 

Equation  (3.3)(a)  can  be  thought  of  as  applying  the  dynamics  matrix  A  to  the  mean  to 
form  an  initial  prediction  of  Xt.  Similarly,  Equation  (3.3)(b)  can  be  interpreted  as  using 
the  dynamics  matrix  A  and  error  covariance  Q  to  form  an  initial  estimate  of  the  belief 
covariance  Pt.  The  estimates  are  then  adjusted: 

xt\t  =  xt\t-i  +  KtCt  (3.3c) 

Pt\t  =  Pt\t-i  ~  KtCPt\t-i  (3.3d) 

where  the  error  in  prediction  at  the  previous  time  step  (the  innovation)  et  and  the  Kalman 
gain  matrix  Kt  are  computed  as  follows: 


et  =  yt-  {Cxt\t-i)  (3.3e) 

Kt  =  Ptit-iC^{CPtit-iC^  +  R)-^  (3.3f) 

The  weighted  error  in  Equation  (3.3)(c)  corrects  the  predicted  mean  given  an  observation, 
and  Equation  (3.3)(d)  reduces  the  variance  of  the  belief  by  an  amount  proportional  to  the 
observation  covariance.  Taken  together.  Equations  3.3(a-f)  define  a  specific  form  of  the 
Kalman  filter  known  as  the  forward  innovation  model. 


3.2.2  The  Backward  Pass  (RTS  Equations) 

The  forward  pass  finds  the  mean  and  variance  of  the  states  x*,  conditioned  on  past  ob¬ 
servations.  The  backward  pass  corrects  the  results  of  the  forward  pass  by  evaluating  the 
influence  of  future  observations  on  these  estimates.  Once  the  forward  recursion  has  com¬ 
pleted  and  the  final  values  of  the  mean  and  variance  xt\t  and  Pt\t  have  been  calculated, 
the  backward  pass  proceeds  in  reverse  by  evaluating  the  influence  of  future  observations 
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on  the  states  in  the  past: 


Xt\T  —  Xt\t  +  Gt{Xt+l\T  —  Xt+l\t) 

Pt\T  =  Pt\t  +  Gt{Pt+l\T  ~  Pt+l\t)Gl 


(3.4a) 

(3.4b) 


where  Xt+\\t  and  Pt+i\t  are  1-step  predietions 


Xt+l\t  —  ^Xt\t 

Pt+i\t  =  ^Pt\t^  +  Q 


(3.4e) 

(3.4d) 


and  the  smoother  gain  matrix  G  is  eomputed  as: 


(3.4e) 


The  eross  varianee  Pt,t-i\T  =  Cov[Xt_i,  Xt\yi,T\,  a  useful  quantity  for  parameter  estima¬ 
tion  (seetion  3.3.1),  may  also  be  eomputed  at  this  point: 


Pt-l,t\T  —  Gt-lPt\T 


(3.4f) 


3.3  Learning  Linear  Dynamical  Systems 

Learning  a  dynamioal  system  from  data  (system  identifieation)  involves  finding  the  para¬ 
meters  6  =  {A,  C,  Q,  i?}  and  the  distribution  overbidden  variables  Q  =  P{X  \  Y,9) 
that  maximizes  the  likelihood  of  the  observed  data.  The  maximum  likelihood  solution  for 
these  parameters  ean  be  found  through  iterative  teehniques  sueh  as  expeetation  maximiza¬ 
tion  (EM).  An  alternative  approaeh  is  to  use  subspace  identification  methods  to  eompute 
an  asymptotieally  unbiased  solution  in  elosed  form.  In  praetiee,  a  good  approaeh  is  to  use 
subspaee  identifieation  to  find  a  good  initial  solution  and  then  to  refine  the  solution  with 
EM.  The  EM  algorithm  for  system  identifieation  is  presented  in  seetion  3.3.1  and  subspaee 
identifieation  is  presented  in  seetion  3.3.2. 
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3.3.1  Expectation  Maximization 


The  EM  algorithm  is  a  widely  applieable  iterative  proeedure  for  parameter  re-estimation. 
The  objeetive  of  EM  is  to  find  parameters  that  maximize  the  likelihood  of  the  observed 
data  P{Y  \  9)  in  the  presence  of  latent  variables  x.  EM  maximizes  the  log-likelihood: 

C{e)  =  logP(E  I  9)  =  log  [  PiX,Y\  9)dX  (3.5) 

Jx 

Using  any  distribution  over  the  hidden  variables  Q,  a  lower  bound  on  the  log-likelihood 
IP{Q,9)  <  C{9)  can  be  obtained  by  using  Jensen’s  inequality  (at  equation  (3.6b)).  EM  is 
derived  by  maximizing  P{Q,9)  with  respect  to  Q,  which  results  in  Q  =  P{X  \  Y,  9),  the 
posterior  over  hidden  variables  given  the  data  and  current  parameter  settings: 


C{9)  =  logP(y  I  9)  =  log  /  P(X,E  I  9)dX 


lx 


=  log  /  Q{X) 


>  X 


P{X,Y  I  9) 

Q{X) 


dX>  /  Q(X)log 


lx 


P{X,Y  I  9) 

Q{x) 


dx 


=  f  Q{X)  log  P{X,  Y  \9)dX  -  [  Q{X)  log  Q{X)dx 
Jx  Jx 

=  X{Q,9) 


(3.6a) 

(3.6b) 

(3.6c) 

(3.6d) 


The  EM  algorithm  alternates  between  maximizing  the  lower-bound  on  the  likelihood  T 
with  respect  to  the  parameters  9  and  the  distribution  Q,  holding  the  other  quantity  fixed. 
Starting  from  some  initial  parameters  9q  we  alternately  apply: 

Expectation- step  (E-step):  Qk+i  argmaxP(Q,  0^)  (3.7a) 

Maximization-step  (M-step):  6*fc+i  ^  argmaxP(Qfc+i,  6*)  (3.7b) 

9 

where  k  indexes  an  iteration,  until  convergence. 


The  E-Step  The  E-step  is  maximized  when  Q  is  exactly  the  conditional  distribution  of 
X,  Qk+i{X)  =  P{X  I  E,  ^fc),  at  which  point  the  bound  becomes  an  equality:  P{Qk+i,9k)  = 
C{9).  The  maximum  value  of  Qk+i{X)  can  be  found  by  solving  the  EDS  inference 
(Kalman  smoothing)  problem:  estimating  the  hidden  state  trajectory  given  the  inputs,  the 
outputs,  and  the  parameter  values.  This  algorithm  is  outlined  in  section  3.2. 
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The  M-Step  As  noted  in  Equation  (3.7)(b),  theM-step  is  to  find  the  maximum  oi  J^{Qk+ii 
j'x  Qk+i{X)  logP(X,  Y  I  0)dX  -  Qk+i(X)  log  Qk+i(X)dx  with  respect  to  0.  The 
parameters  of  the  system  =  {A,  C,  Q,  R}  are  estimated  by  taking  the  corresponding 
partial  derivative  of  the  expected  log-likelihood,  setting  to  zero  and  solving,  resulting  in 

C  =  (  '^ytE{xJ  I  2/i:t}  )  (  '^E{xtxJ  \  yi-.r} 


.t=i 


.  i=l 


^  I  y^-T}y^ 


,  t=i 


t=i 


T 


A=  (  '^E{xtxJ_^  I  yi-.r}  j  [  '^E{xt-ixJ_^  \  yi-.r} 

.t=2  J  \  t=2 

1 

T~1 


(3.8a) 

(3.8b) 

(3.8c) 

(3.8d) 


Q  =  I  yi:T}  -  A^E{xt-ixJ  I  yi-.r} 

. t=2  t=2 

Also  note  that  the  state  covariance  matrix  Q  (see  Equation  (3.8d))  is  positive  semi-definite 
by  construction  since  it  is  the  Schur  complement  [32]  of 


i=l 


XtxJ  XtxJ_^ 
Xt-ixj  Xt-ixj_-^ 


yi:T  r  —  0 


(3.9) 


3.3.2  Subspace  Identification 


Eearning  algorithms  based  on  Expectation-Maximization  (EM)  iteratively  optimize  the 
observed  data  likelihood  by  (a)  computing  posterior  estimates  of  first  and  second  moments 
of  the  latent  variable,  and  (b)  computing  the  most  likely  parameters  given  these  estimates 
of  the  latent  variable.  The  Maximum-Eikelihood  Estimate  (MEE)  is  statistically  efficient, 
and  EM-based  methods  can  compute  the  MEE  from  finite  data  samples.  However,  EM- 
based  methods  are  computationally  inefficient  because  they  only  guarantee  finding  a  local 
optimum  of  the  observed  data  likelihood  from  a  given  starting  point,  and  hence  require 
multiple  restarts  from  random  parameter  initializations  to  search  for  the  MEE. 

In  contrast.  Subspace  Identification  (Subspace  ID)  algorithms  view  the  sequential  data 
model  learning  task  as  being  a  reduced-rank  regression  from  past  to  future  observations. 
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Figure  3.1:  A.  Sunspot  data,  sampled  monthly  for  200  years.  Eaeh  eurve  is  a  month,  the 
x-axis  is  over  years.  B.  First  two  prineipal  eomponents  of  a  1-observation  Hankel  matrix. 
C.  First  two  prineipal  eomponents  of  a  12-observation  Hankel  matrix,  whieh  better  refleet 
temporal  patterns  in  the  data. 


with  the  goal  of  minimizing  the  predictive  reconstruction  error  in  L2  norm.  The  reason 
it  is  not  an  ordinary  regression  task  is  that  the  algorithm  must  eompute  a  subspace  of  the 
observation  spaee  whieh  allows  predietion  of  future  observations.  This  subspaee  is  pre- 
eisely  the  domain  of  the  multivariate  eontinuous  latent  state  variable,  and  the  parameters 
that  map  from  this  subspaee  to  future  latent  states  and  observations  are  preeisely  the  dy- 
namies  matrix  and  observation  matrix  of  the  LDS  and  produets  thereof.  Note  that,  like 
other  LVMs,  sinee  the  LDS  is  an  unidentifiable  model,  we  only  aim  to  diseover  the  eorreet 
parameters  up  to  a  similarity  transform.  The  benefits  and  drawbaeks  of  Subspaee  ID  are 
in  some  ways  eomplementary  to  those  of  EM.  Unlike  multi-restart  EM,  Subspaee  ID  is 
somewhat  statistically  inefficient  sinee  it  does  not  aehieve  the  MLE  for  finite  data  sam¬ 
ples,  but  it  is  mueh  more  computationally  efficient  sinee  it  is  not  prone  to  loeal  minima, 
and  the  Singular  Value  Decomposition  (SVD)  [32]  attains  the  optimum  parameter  estimate 
effieiently  in  the  limit.  Thus  at  the  granularity  where  SVD  is  a  subroutine,  Subspaee  ID  is 
a  non-iterative  algorithm  that  admits  a  closed-form  solution,  though  SVD  internally  is  an 
iterative  algorithm. 

Subspaee  ID  has  been  deseribed  elearly  in  the  literature  in  several  plaees  sueh  as  Van 
Oversehee  (1996)  [27],  Katayama  (2005)  [29]  and  more  reeently  in  Boots  (2009)  [3]. 
We  summarize  the  uncontrolled  version  of  the  algorithm  here  and  refer  the  reader  to  the 
aforementioned  referenees  for  details. 

One  useful  degree  of  freedom  that  Subspaee  ID  allows  us  is  the  ability  to  ineorporate 
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knowledge  about  the  d-step  observability  of  the  linear  system  by  specifying  the  length 
of  observation  sequences  that  are  treated  as  features  by  the  algorithm.  This  is  accom¬ 
plished  by  stacking  observations  in  a  block  Hankel  matrix  [26]  during  regression,  forcing 
the  resulting  parameter  estimates  to  reconstruct  entire  sequences  of  observations  based  on 
multiple  past  observations.  Define  To|j_i  as  the  following  matrix  of  observations,  where  0 
and  i  are  timesteps  within  the  training  data  sequence: 


^0|i-l  — 


Vo  yi 

yi  y2 


yj-i 

yj 


yi-i  yi  ■■■  yi+j-2 


mixj 


(3.10) 


Yp  denotes  a  certain  matrix  of  “past”  observations,  and  denotes  its  one-timestep  exten¬ 
sion.  Also,  Yf,  Y^  denote  matrices  of  “future”  inputs  and  observations  and  their  one-step 
contractions: 


Yp  =  Y,\i_^  Y+  =  Y^\, 

Yf  =  Yii2i-i  Yj  =  Yij^i\2i-i 

Matrices  of  the  above  form,  with  each  block  of  rows  equal  to  the  previous  block  but  shifted 
by  a  constant  number  of  columns,  are  called  block  Hankel  matrices  [26].  We  also  define 
a  matrix  of  Kalman  filter  latent  state  estimates  at  time  i,  conditioned  on  past  observations 
in  Yp,  as  Xp 

Xi  =  [xi  Xi+i  . . .  Xi+j]  e  (3.11) 


Assuming  the  observations  truly  arise  from  an  LDS,  then  the  following  relationship 
between  expected  future  observations,  latent  states  and  LDS  parameters  must  hold: 


E{Yf  I  X,} 


Cxi 

Cxi+i 

Cxi+i 

CXi+2 

CXi+2 

Cx2i-1  Cx2i 


Cxj-i 

Cxj 

Cxj+i 


Cx 


2i+j-2 


mixj 
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(3.12) 


r i  is  defined  as  the  extended  observability  matrix: 

C 
CA 
CA^ 


r,  = 


CA 


i—  1 


(3.13) 


Fj  is  related  to  its  one-step  eontraetion  Fj.i  by: 

F,  = 


r.-i 

CA^-^ 


Using  Fj  from  equation  (3.13)  in  equation  (3.12),  IE{1/  |  Xj}  ean  be  written  as: 

E{Yf  I  Xi}  =  F,Xi 


(3.14) 


(3.15) 


Note  that  FjXj  is  a  rank  n  linear  funetion  of  state  that  lies  in  spanjF),}.  The  linear 
projection  of  Yf  onto  Yp  may  be  used  to  find  FjX  from  Yj  and  Yp,  where  X  denotes  the 
Kalman  filter  states  eonditioned  on  observations  in  Yp.  These  Kalman  filter  state  estimates 
Xj  ean  be  eomputed  exaetly  in  elosed  form  as  a  linear  funetion  of  Yp  [27].  The  projeetion 
is  obtained  by  solving  a  set  of  linear  equations 


Yf/y,  =  =  yfYp  (ypyY'yo  p.i6) 

where  denotes  the  Moore-Penrose  pseudo-inverse  [32].  Define  Oi,  C>i+i  as  projections 
of  future  observations  onto  past  observations  in  the  following  way: 


O,  =  Yf/Yp  =  F,X,  (3.17a) 

=  Yf/Y^  =  F,_iX,+i  (3.17b) 

Subspaee  ID  exploits  relationships  between  several  subspaees  of  interest  to  eompute  es¬ 
timates  of  Xj  and  Xj+i.  The  rank  of  Oi  is  the  dimensionality  of  the  state  spaee,  the  row 
spaee  of  Oi  is  equal  to  the  row  spaee  of  Xj  and  the  eolumn  spaee  of  Oi  is  equal  to  the 
eolumn  spaee  of  Fj  [27].  Compute  the  SVD  of  Oi  : 


Oi  = 


(3.18) 
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By  properties  of  SVD  ,  we  know  the  eolumns  of  U  are  an  optimal  basis  for  eompressing 
and  reeonstrueting  sequenees  of  i  future  observations. 

As  mentioned  earlier,  having  multiple  observations  per  eolumn  in  1/  is  partieularly 
helpful  for  learning  models  of  systems  that  are  not  1-step  observable,  e.g.  when  the  un¬ 
derlying  dynamieal  system  is  known  to  have  periodieity  and  a  single  observation  is  not 
enough  to  tell  us  where  we  are  in  the  period.  For  example.  Figure  3.1(A)  shows  200  years 
of  sunspot  numbers,  with  eaeh  month  modeled  as  a  separate  variable.  Sunspots  are  known 
to  have  two  periods,  the  longer  of  whieh  is  11  years.  When  subspaee  ID  is  performed 
using  a  12-observation  Hankel  matrix  Yf  and  Yp,  the  first  two  eolumns  of  hi,  i.e.  the  first 
two  prineipal  eomponents  of  Oi,  resemble  the  sine  and  eosine  bases  (Figure  3.1(C)),  and 
the  eorresponding  state  variables  therefore  are  the  eoeffieients  needed  to  eombine  these 
bases  so  as  to  reeonstruet  12  years  of  the  original  sinusoid-like  data,  whieh  eaptures  their 
periodieity.  This  is  in  eontrast  to  the  bases  obtained  by  SVD  on  a  1-observation  Yf  and  Yp 
(Figure  3.1(B)),  whieh  reeonstruet  just  the  variation  within  a  single  year. 

Though  Yp  and  Yf  have  finitely  many  observations  from  past  and  future,  this  window 
size  does  not  need  to  grow  indefinitely  for  the  algorithm  to  eonverge  to  the  eorreet  parame¬ 
ter  estimates.  For  any  given  dynamieal  system,  there’s  a  minimum  window  length  that  will 
allow  us  to  reeover  the  true  dynamies.  The  minimum  window  length  is  the  shortest  one 
in  whieh  we  are  guaranteed  to  have  positive  probability  of  observing  something  relevant 
to  eaeh  dimension  of  latent  state.  Note  that  we  don’t  need  to  guarantee  a  high  probability 
of  making  sueh  observations,  or  to  guarantee  that  our  observations  are  partieularly  infor¬ 
mative  about  state,  only  that  there  is  a  positive  probability  of  getting  a  positive  amount 
of  information.  It  is  possible  that  the  minimum  window  length  is  infinite,  but  then  the 
dimensionality  of  the  dynamieal  system  would  need  to  be  infinite  as  well,  henee  it  would 
a  system  that  is  diffieult  to  reeover  with  any  method. 

Subspaee  ID  also  allows  a  simple  yet  prineipled  method  of  model  seleetion:  the  order 
of  the  system  ean  be  determined  by  inspeeting  the  singular  values  on  the  diagonal  E,  and 
all  eomponents  whose  singular  values  lie  below  a  user-defined  threshold  ean  be  dropped. 
One  eommon  way  of  ehoosing  the  dimensionality  is  to  look  for  a  “knee”  in  the  graph  of 
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decreasing-order  singular  values,  indicating  the  noisy  data  arises  from  a  low-rank  system. 
Then  we  estimate  Tj,  Tj-i  and  Xi  as: 


which,  based  on  equation  (3.14),  allows  us  to  estimate  Tj.i: 

Xi  =  rlOi  (3.20a) 

X,+1  =  (3.20b) 

Here  we  use  Xj  to  denote  estimates  of  the  true  Kalman  filter  states  X  conditioned  on 
observations  in  Yp.  An  important  result  that  allows  Subspace  ID  to  claim  consistency  is 
that,  as  the  number  of  columns  j  in  our  Hankel  matrices  go  to  infinity,  the  state  estimates 
converge  to  the  true  Kalman  filter  state  estimates  conditioned  on  observations  in  Yp  (up  to 
a  linear  transform)  [27]: 


X, 


j-l-1 


X 


i-l-1 


After  computing  the  estimates  Xj  and  Xj+i  we  can  estimate  the  LDS  parameters  A  and  C 
by  solving  the  following  system  of  equations,  which  is  straightforward: 


X,+i 


A 

C 


X, 

+ 

pw 

pv 

(3.21a) 


Here,  PwiPv  are  the  residual  errors  of  the  noisy  LDS,  which  are  assumed  to  have  zero- 
mean  Gaussian  distributions.  We  can  estimate  the  covariances  Q  and  R  from  the  estimated 
residuals: 


Q  S 
R 


E  J 

pw 

1 

1 

pv 

J 

(3.21b) 


Since  the  state  estimates  converge  to  their  true  values,  the  parameter  estimates  6  =  {A,  C,Q,R} 
are  asymptotically  unbiased  as  the  length  of  the  training  sequences  goes  to  infinity  [27]. 
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3.4  Stability 


We  describe  stability  of  dynamical  systems  based  on  material  from  [33].  Stability  is  a 
property  of  dynamical  systems  defined  in  terms  of  equilibrium  points.  If  all  solutions 
of  a  dynamical  system  that  start  out  near  an  equilibrium  state  Xf.  stay  near  or  converge 
to  Xe,  then  the  state  Xe  is  stable  or  asymptotically  stable  respectively.  A  linear  system 
Xt+i  =  Axt  (with  zero  noise)  is  internally  stable  if  the  matrix  A  is  stable  in  the  sense 
of  Lyapunov  (see  below).  Internal  stability  is  sufficient,  though  not  necessary,  for  the 
stability  of  a  dynamical  system.  The  standard  algorithms  for  learning  linear  Gaussian 
systems  described  in  Section  3.3  do  not  enforce  stability,  when  learning  from  finite  data 
samples,  the  maximum  likelihood  or  subspace  ID  solution  may  be  unstable  even  if  the  true 
system  is  stable  due  to  the  sampling  constraints,  modeling  errors,  and  measurement  noise. 
A  square  matrix  A  is  said  to  be  asymptotically  stable  in  the  sense  of  Lyapunov  if  and 
only  if  for  a  dynamics  matrix  A  and  any  given  positive  semi-definite  symmetric  matrix  Q 
there  exists  a  positive-definite  symmetric  matrix  P  that  satisfies  the  following  Lyapunov 
criterion: 


P  -  APA'^  =  Q  (3.22) 

There  is  a  direct  connection  to  this  criterion  and  the  Kalman  filter  update  in  Equation 
(3.3)(b).  For  a  linear  dynamical  system,  A  is  the  dynamics  matrix,  P  is  the  current  belief 
covariance,  and  Q  is  the  positive  semi-definite  state  error  covariance  matrix  (Equation 
(3.8d)).  Thus,  the  Eyapunov  criterion  can  be  interpreted  as  holding  if  there  exists  a  belief 
distribution  where  the  predicted  belief  over  state  is  equivalent  to  the  previous  belief  over 
state.  It  is  interesting  to  note  that  the  Eyapunov  criterion  holds  if  and  only  if  the  spectral 
radius  p{A)  <  1.  Recall  that  a  matrix  M  is  positive  definite  (semi-definite)  ijf  Mz  >  0 
(>  0)  for  all  non-zero  vectors  2;.  Eet  A  be  an  left  eigenvalue  of  A  and  i/  be  a  corresponding 
eigenvector,  giving  us  lA A  =  XiA ,  then 

lAQy  =  jA{P  -  A^PA)u  =  lAPu  -  lAXPXu  =  iAPu{l  -  IAH  >  0  (3.23) 

since  lA Pu  >  0,  it  follows  that  |A|  <  1  and  thus  p{A)  <  1.  When  p{A)  <  1,  the  system 
is  asymptotically  stable.  To  see  this,  suppose  DkD~^  is  the  eigen-decomposition  of  A, 
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where  A  has  the  eigenvalues  of  A  along  the  diagonal  and  £)  eontains  the  eigenvectors. 
Then, 


(3.24) 


since  it  is  clear  that  linifc^oo  =  0.  If  p(A)  =  1,  then  A  is  stable  but  not  asymptotically 
stable,  and  the  state  Xt  oscillates  around  Xe  indefinitely.  However,  this  is  true  only  under 
the  assumption  of  zero  noise.  In  the  case  of  an  LDS  with  Gaussian  noise,  a  dynamics 
matrix  with  unit  spectral  radius  would  cause  the  state  estimate  to  move  steadily  away  from 
Xg.  Hence  such  an  LDS  is  asymptotically  stable  only  when  p{A)  is  strictly  less  than  one, 
and  the  matrix  Q  in  equation  (3.22)  above  is  required  to  be  positive  definite.  If  p{A)  =  1, 
the  LDS  is  said  to  be  marginally  stable. 


3.5  Related  Work 


The  EM  algorithm  for  LDS  was  originally  presented  in  [34].  Auto-Regressive  (AR),  Mov¬ 
ing  Average  (MA)  and  Auto-Regressive  Moving  Average  (ARMA)  models  are  simpler 
time  series  modeling  methods  that  are  provably  subsumed  by  LDSs  [25].  Nonlinear  dy¬ 
namical  systems  and  their  learning  algorithms  have  also  been  studied,  such  as  the  extended 
Kalman  filter  [35,  36]  which  linearizes  the  nonlinear  system  around  the  state  estimate 
at  every  step,  allowing  the  approximate  state  distribution  to  remain  Gaussian.  An  EM 
algorithm  for  learning  nonlinear  dynamical  systems  has  also  been  proposed[16],  which 
uses  Extended  Kalman  Smoothing  to  compute  the  non-Gaussian  conditional  hidden  state 
distribution  over  the  nonlinear  dynamical  system,  and  Radial  Basis  Eunctions  (RBEs)  to 
represent  the  nonlinearities. 
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Chapter  4 


Fast  State  Discovery  and  Learning  in 
Hidden  Markov  Models 


We  first  look  at  an  algorithm  for  learning  diserete- state  LVMs  with  eontinuous  observa¬ 
tions,  speeifioally  Gaussian  HMMs.  Typieal  algorithms  for  learning  HMMs  rely  on  know¬ 
ing  the  eorreet  value  for  the  number  of  states  beforehand,  and  then  optimizing  the  observed 
data  likelihood  for  a  fixed  number  of  states,  until  a  loeal  optimum  is  reaehed.  In  eontrast, 
STAGS  (Simultaneous  Temporal  and  Contextual  Splitting)  reformulates  the  seareh  spaee 
by  inerementally  inereasing  the  number  of  states  (by  splitting  an  existing  state)  during 
parameter  optimization  in  a  way  that  maximally  improves  observed  data  likelihood.  The 
algorithm  terminates  when  the  improvement  in  explanation  of  the  data  by  further  HMM 
growth  no  longer  justifies  the  inerease  in  model  eomplexity,  aeeording  to  a  standard  model 
seleetion  seoring  eriterion.  Both  the  splitting  and  seoring  proeesses  are  earned  out  effi- 
eiently  by  seleetive  applieation  of  Viterbi  approximations.  This  proeess  makes  parameter 
learning  more  effieient  and  also  helps  avoid  the  loeal  minima  whieh  greatly  hinder  fixed- 
topology  HMM  learning  algorithms,  partieularly  for  large  state  spaees. 
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4.1  Introduction 


There  has  been  extensive  work  on  learning  the  parameters  of  a  fixed-topology  HMM.  Sev¬ 
eral  algorithms  for  finding  a  good  number  of  states  and  eorresponding  topology  have  been 
investigated  as  well  (e.g.  [37,  38,  39]),  but  none  of  these  are  used  in  praetiee  beeause 
of  one  or  more  of:  inaccuracy,  high  computational  complexity,  or  being  hard  to  imple¬ 
ment.  Normally,  the  topology  of  an  HMM  is  ehosen  a  priori  and  a  hill-elimbing  method 
is  used  to  determine  parameter  settings.  For  model  seleetion,  several  HMMs  are  typi- 
eally  trained  with  different  numbers  of  states  and  the  best  of  these  is  ehosen.  There  are 
two  problems  with  this  approaeh:  firstly,  training  an  HMM  from  serateh  for  eaeh  feasi¬ 
ble  topology  may  be  eomputationally  expensive.  Seeondly,  sinee  parameter  learning  is 
prone  to  loeal  minima,  we  may  inadvertently  end  up  eomparing  a  ‘good’  mi-state  HMM 
to  a  ‘bad’  m2-state  HMM.  The  standard  solution  to  this  is  to  train  several  HMMs  for  eaeh 
topology  with  different  parameter  initializations  in  order  to  overeome  loeal  minima,  whieh 
however  eompounds  the  eomputational  eost  and  is  ineffeetual  for  large  state  spaees.  For 
example.  Figure  4.1(A)  shows  a  simple  data  sequenee  where  many  training  algorithms 
ean  get  eaught  in  loeal  minima.  To  illustrate  the  eoneepts  we  diseuss,  we  shall  use  this 
example  throughout  this  ehapter. 

Beeause  of  these  issues,  many  researehers  have  previously  investigated  top-down  state¬ 
splitting  methods  as  an  appealing  ehoiee  for  topology  learning  in  HMMs  with  eontinu- 
ous  observation  densities.  This  ehapter  deseribes  Simultaneous  Temporal  and  Contextual 
Splitting  (STAGS),  a  reeently  proposed  [40]  top-down  model  seleetion  and  learning  algo¬ 
rithm  that  eonstruets  an  HMM  by  alternating  between  parameter  learning  and  model  selee¬ 
tion  while  inerementally  inereasing  the  number  of  states.  Candidate  models  are  generated 
by  splitting  existing  states  and  optimizing  relevant  parameters,  and  are  then  evaluated  for 
possible  seleetion.  Unlike  previous  methods,  however,  the  splitting  is  earried  out  in  a  way 
that  aeeounts  for  both  contextual  (observation  density)  and  temporal  (transition  model) 
strueture  in  the  underlying  data  in  a  more  general  manner  than  the  state- splitting  methods 
mentioned  above,  whieh  fail  on  the  simple  example  in  Figure  4.1(A).  In  this  researeh,  we 
elosely  examine  these  eompeting  methods  and  illustrate  the  key  differenees  between  them 
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and  STAGS,  followed  by  an  extensive  empirical  comparison.  Since  hard-updates  training 
is  a  widely  used  alternative  to  soft-updates  methods  in  HMMs  because  of  its  efficiency, 
we  also  examine  a  hard-updates  version  of  our  algorithm,  Viterbi  STAGS  (V-STACS),  and 
explore  its  pros  and  cons  for  model  selection  and  learning  with  respect  to  the  soft-updates 
method.  We  also  evaluate  STAGS  as  an  alternative  learning  algorithm  for  models  of  pre¬ 
determined  size.  To  determine  the  stopping  point  for  state- splitting,  we  use  the  Bayesian 
Information  Criterion  [41],  or  BIG  score.  We  discuss  the  benefits  and  drawbacks  of  this 
in  Section  4.3.3.  We  compare  our  approach  to  previous  work  on  synthetic  data  as  well 
as  several  real-world  data  sets  from  the  literature,  revealing  significant  improvements  in 
efficiency  and  test-set  likelihoods.  We  compare  to  previous  algorithms  on  a  sign-language 
recognition  task,  with  positive  results.  We  also  describe  an  application  of  STAGS  to  learn¬ 
ing  models  for  detecting  diverse  events  in  real-world  unstructured  audio  data. 

Throughout  this  chapter,  assume  the  HMM  notation  introduced  in  Ghapter  2. 


Figure  4.1:  A.  A  time  series  from  a  4-state  HMM.  Observations  from  the  two  states  down- 
middle-up  and  up-middle-down  overlap  and  are  indistinguishable  without  temporal  infor¬ 
mation.  B.  The  HMM  topology  learned  by  ML-SSS  and  Li-Biswas  on  the  data  in  A.  G. 
The  correct  HMM  topology,  successfully  learned  by  the  STAGS  algorithm. 
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4.2  Related  Work 


There  has  been  extensive  work  on  HMM  model  seleetion.  However,  most  of  this  work  is 
either  tailored  to  a  speeifie  applieation  or  is  not  sealable  to  learning  topologies  with  more 
than  a  few  states.  The  most  sueeessful  approaehes  are  greedy  algorithms  that  are  either 
bottom-up  (i.e.,  starting  with  an  overly  large  number  of  states)  or  top-down  (i.e.,  starting 
with  a  small  or  single-state  model).  We  will  briefly  diseuss  the  bottom-up  approaehes  and 
argue  that  they  are  unsuitable  for  praetieal  large-seale  HMM  model  seleetion  as  eompared 
to  top-down  approaehes.  After  this,  we  will  diseuss  the  major  top-down  approaehes  from 
the  literature  in  more  detail. 

The  primary  drawbaeks  of  bottom-up  teehniques  are  (a)  having  to  ehoose  and  evalu¬ 
ate  merges  from  among  eandidate  pairs,  (b)  having  to  know  the  maximum  number  of 
states  beforehand  and  (e)  diffieulty  in  generalizing  to  real-valued  observations.  Bottom-up 
topology  learning  teehniques  start  off  with  a  superfluously  large  HMM  and  prune  para¬ 
meters  and/or  states  inerementally  to  shrink  the  model  to  an  appropriate  size.  Stolke  and 
Omohundro  [42]  demonstrate  a  Bayesian  teehnique  for  learning  HMMs  by  sueeessively 
merging  pairs  of  states  for  diserete-observation  HMMs,  followed  by  Baum-Weleh  to  opti¬ 
mize  parameter  settings.  Their  model-merging  teehnique  starts  off  with  one  state  for  eaeh 
unique  diserete-valued  observation,  uses  a  heuristie  to  piek  the  top  few  merge  eandidates 
to  evaluate.  Another  bottom-up  approaeh  is  the  Entropie  Training  teehnique  of  Brand 
[39].  This  teehnique  uses  an  entropy-based  prior  that  favors  simpler  models  and  relies  on 
an  iteratively  eomputed  MAP  estimator  to  sueeessively  trim  model  parameters.  Though 
the  algorithm  applies  to  real- valued  observations  and  does  not  require  the  eomputation  of 
merges,  it  is  eomplex,  and  the  problem  of  having  to  start  with  a  good  upper  bound  on  N 
still  holds  true. 

We  therefore  favor  top-down  methods  for  HMM  model  seleetion,  espeeially  when  the 
number  of  states  may  be  large.  We  define  some  terminology  first:  split  design  refers  to 
the  proeess  of  splitting  an  HMM  state,  optimizing  parameters  and  ereating  an  HMM  for 
possible  seleetion.  HMMs  ereated  by  designing  different  splits  are  ealled  candidates.  The 
two  major  alternative  top-down  approaehes  ean  be  summarized  as  follows: 


32 


Figure  4.2:  An  illustration  of  the  overly  restrietive  splits  in  ML-SSS.  A.  Original  un-split 
state  h* .  B.  A  contextual  split  of  h*  in  the  ML-SSS  algorithm.  States  ho  and  hi  must  have 
the  same  transition  struetures  and  different  observation  models.  C.  A  temporal  split.  State 
ho  has  the  ineoming  transition  model  of  h*  and  hi  has  its  outgoing  ones. 


Li-Biswas:  This  algorithm  [37]  examines  two  model  seleetion  eandidates  at  every  step: 
one  obtained  by  splitting  the  state  with  largest  observation  density  varianee,  and  the  other 
by  merging  the  two  states  whose  means  are  elosest  in  Euelidean  spaee.  These  eandidates 
are  then  optimized  with  EM  on  the  entire  HMM.  The  eandidate  with  better  likelihood  is 
ehosen  at  eaeh  step,  terminating  when  the  eandidates  are  worse  than  the  original  model. 
The  primary  drawbaek  with  this  heuristie  is  that  it  ignores  dynamie  strueture  while  de- 
eiding  whieh  states  to  split:  a  single  low-varianee  state  might  aetually  be  masking  two 
Markov  states  with  overlapping  densities,  making  it  a  better  split  eandidate.  Training  two 
eandidates  with  full  EM  is  also  ineffieient  espeeially  when  they  may  not  be  the  best  ean¬ 
didates,  as  our  empirieal  evaluations  will  show. 

ML-SSS:  Maximum-Likelihood  Successive-State-Splitting  [38]  is  designed  to  learn 
HMMs  that  model  variations  in  phones  for  eontinuous  speeeh  reeognition  systems.  ML- 
SSS  inerementally  builds  an  HMM  by  splitting  one  state  at  a  time,  eonsidering  all  m 
possible  splits  as  eandidates  in  eaeh  timestep.  However,  instead  of  full  EM  for  eaeh  ean¬ 
didate,  the  split  on  state  h*  into  ho  and  hi  (figure  4.2)  is  trained  by  a  eonstrained  iterative 
optimization  of  the  expeeted  likelihood  gain  from  performing  the  split,  while  holding  all 
7t(h)  and  h')  eonstant  for  h,  h'  ^  h* .  The  iterations  are  performed  over  all  timesteps 
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t  with  non-zero^  posterior  occupancy  probability  for  h*  .  Each  state  is  considered  for  two 
kinds  of  splits,  a  contextual  split  (Figure  4.2(B))  that  optimizes  only  observation  densities, 
and  a  temporal  split  (figure  4.2(C))  that  also  optimizes  self-transition  and  inter-split-state 
transition  probabilities. 

Though  more  efficient  than  brute-force  and  Li-Biswas,  the  fact  that  ML-SSS  does  not 
model  transitions  to  and  from  other  states  while  splitting  makes  it  fail  to  detect  underlying 
Markov  states  with  overlapping  densities.  For  example,  MF-SSS  with  BIC  converges  on 
the  HMM  in  figure  4.1(B)  while  the  true  underlying  HMM,  successfully  found  by  STAGS, 
is  shown  in  figure  4.1(C).  Also,  having  to  consider  all  data  points  with  non-zero  posterior 
probability  7t(/i*)  on  every  split  is  expensive  in  dense  data  with  overlapping  densities. 


4.3  Simultaneous  Temporal  and  Contextual  Splits 

STAGS  is  based  on  the  insight  that  EM  for  sequential  data  is  more  robust  to  local  min¬ 
ima  for  small  state  spaces  (e.g.  two  states)  but  less  so  for  large  state  spaces.  STAGS  uses 
this  insight  to  use  a  constrained  two-state  EM  algorithm  to  break  out  of  local  minima  and 
increase  state  space  size.  STAGS  algorithms  perform  parameter  learning  for  continuous- 
density  HMMs  with  Gaussian  observation  models,  while  simultaneously  optimizing  state 
space  size  and  transition  topology  in  a  data-driven  fashion.  Unlike  [38],  our  method  ac¬ 
counts  for  both  temporal  and  contextual  variations  in  the  training  observations  while  split¬ 
ting  states  (i.e.  increasing  the  state  space  size).  Unlike  [37],  our  method  evaluates  every 
existing  state  as  a  possible  candidate  for  splitting.  Since  naively  evaluating  all  possible 
ways  to  increase  state  space  size  would  be  very  expensive  computationally,  STAGS  algo¬ 
rithms  make  selective  use  of  Viterbi  approximations  [42,  43]  for  efficient  approximation  of 
the  data  log-likelihood,  as  well  as  some  other  assumptions  detailed  below.  These  approx¬ 
imations  keep  the  complexity  of  each  iteration  of  STAGS  and  V-STACS  to  be  0{Tm?), 
with  V-STACS  being  faster  by  a  constant  factor  that  is  quite  noticeable  in  practice.  Here  r 
is  the  length  of  the  training  sequence  and  m  is  the  number  of  states  currently  in  the  HMM. 

'in  practice,  non-zero  corresponds  to  being  above  a  specified  cutoff  value 
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Experiments  show  that  STAGS  outperforms  alternative  topology  learning  methods  [38, 
37]  in  terms  of  running  time,  test-set  likelihood  and  BIG  score  on  a  variety  of  real-world 
data,  and  outperforms  regular  EM  even  on  learning  models  of  predetermined  size.  This 
top-down  approach  proved  to  be  highly  effective  at  avoiding  local  minima  as  well,  allow¬ 
ing  STAGS  to  discover  the  true  underlying  HMM  in  difficult  synthetic  data  where  EM 
failed  to  find  the  correct  answer  even  with  50  restarts.  This  highlights  the  problem  with 
cross-validation  for  determining  the  number  of  states,  which  we  alluded  to  in  Section  4.1: 
due  to  the  local  minima  problems  rife  in  EM-based  HMM  learning,  there  is  no  way  to 
ensure  during  cross-validation  that  the  best  possible  HMMs  of  different  sizes  are  being 
compared.  STAGS  avoids  this  problem  by  guaranteeing  a  consistent  increase  in  data  likeli¬ 
hood  as  it  searches  the  space  of  HMMs  of  varying  state  space  sizes,  based  on  monotonicity 
results  regarding  variants  of  the  EM  algorithm  [44]. 

We  describe  the  overall  algorithm  as  well  as  details  of  STAGS  and  V-STAGS  below. 
Eirst,  some  notation:  when  considering  a  split  of  state  h,  HMM  parameters  related  to  state 
h  (denoted  by  Xh),  including  incoming  and  outgoing  transition  probabilities,  are  replaced 
by  parameters  for  two  offspring  states  hi  and  /i2  (denoted  by  Xhi,h2)-  The  time  indices 
assigned  to  state  h,  denoted  by  T{h),  are  now  assumed  to  have  unknown  hidden  state 
values,  but  only  in  the  restricted  state  space  {hi,h2}.  Therefore  when  searching  for  a 
locally  optimal  candidate,  only  the  parameters  XhiM  change,  and  the  only  observations 
that  will  affect  them  are  those  at  timesteps  T{h),  i.e.,  Eet  X\h  denote  parameters  not 
related  to  state  h. 


4.3.1  The  Algorithm 

STAGS  and  V-STAGS  both  have  the  following  overall  procedure.  Eor  each  step  that  is 
not  constant-time,  we  list  the  asymptotic  complexities  as  [■]  or  as  [•,  •]  respectively  if  they 
differ.  Details  on  candidate  generation  and  candidate  selection  are  given  in  subsequent 
sections. 

1.  Initialization:  Initialize  A  to  a  single-state  HMM  (m  =  1)  using  X.  [0{t)] 
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2.  Learning:  Use  Baum- Welch  or  Viterbi  Training  until  convergence  of  P[X  |  A]. 

3.  Candidate  Generation:  Split  each  state,  to  generate  m  candidate  HMMs  each  with 
m  +  1  states.  [(9(rm^),  C>(rm)] 

4.  Candidate  Selection:  Score  the  original  HMM  and  each  candidate,  and  pick  the 
highest  scoring  one  as  A'.  [Cl(rm^),  O^rm)] 

5.  Repeat  or  Terminate:  If  a  split  candidate  was  picked,  A  ^  A',  m  ^  m  +  1  and  go  to 
step  2.  Else  if  original  HMM  was  picked,  terminate  and  return  A'. 

4.3.2  Generating  Candidates 

To  generate  a  candidate  based  on  state  h  resulting  in  new  states  hi  and  /i2,  we  devised 
two  novel  variants  of  EM  to  efficiently  compute  optimal  means,  variances  and  transition 
parameters  of  the  resulting  2  states.  We  first  perform  Viterbi  to  find  the  optimal  state 
sequence  H*  by  maximizing  P{X,  H  \  \).  We  then  constrain  the  parameters  for  all  other 
states  We  assume  all  timesteps  belonging  to  other  states  in  Q*  are  associated  with 
those  states  exclusively,  which  is  equivalent  to  approximating  the  posterior  belief  at  each 
timestep  by  a  delta  function  at  the  Viterbi-optimal  state.  Then,  we  perform  Split-State 
Viterbi  Training  (for  V-STACS)  or  Split-State  Baum-Welch  (for  STACS)  to  optimize  XhiM 
on  the  timesteps  associated  with  state  h  i.e.  This  effectively  optimizes  a  partially 

observed  likelihood  P{X,  I  A). 

Candidate  generation  is  highly  efficient:  Split-State  Viterbi  Training  is  0{\T{h)\), 
Split-State  Baum-Welch  is  0{m\T{h)\).  Since  |r(/i)|  is  equal  to  rim  on  average,  and 
there  are  m  candidates  to  be  generated,  the  total  cost  is  0{mT)  and  0{Tm‘^)  respectively. 

Split- State  Viterbi  Training 

Split-State  Viterbi  Training  is  the  candidate  generation  algorithm  for  V-STACS.  The  algo¬ 
rithm  learns  locally  optimal  values  for  the  parameters  \hi,h2  by  alternating  the  following 
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two  steps  for  each  iteration  i  until  convergence: 


E-step  :  ^  argmax^  P(X,(;,),  |  \\h.K,M) 


]VI“StCp  .  ^h^  ho  ^  7 5  I  ^\hi  -^) 

Here,  denotes  the  base  model  Viterbi  path  excluding  timesteps  belonging  to  state 

h.  The  first  step  above  computes  an  optimal  path  through  the  split-state  space.  The  second 
step  updates  the  relevant  HMM  parameters  with  their  MLE  estimates  for  a  fully  observed 
path,  which  are  simply  ratios  of  counts  for  transition  parameters,  and  averages  of  subsets 
of  Xr(h)  for  the  observation  parameters.  Convergence  occurs  when  the  state  assignments 
on  r(/i)  stop  changing.  The  E-step  of  Split-State  Viterbi  Training  is  carried  out  using 
a  novel  adaptation  of  Viterbi  (called  Split-State  Viterbi)  to  the  task  of  finding  an  opti¬ 
mal  path  over  a  binary  state  space  through  a  subset  of  the  data  while  constraining  the 
rest  to  specific  states.  Given  below  is  pseudocode  for  the  Split-State  Viterbi  algorithm. 
Split-State  Viterbi(X,  X^h) 


Initialization:  Increase  the  number  of  states  in  the  HMM  by  1.  Initialize  new  para¬ 
meters  for  states  hi  and  /i2  that  replace  state  h.  Eor  all  other  states  h'  and  for  i,j  =  1,2: 

Th'hi  ^  \Th'h 

Thh' 


Thih' 
Tfnhj 
Ohi  *■ 


2'^hh 


initialize  to  MEE  Gaussian  using  plus  noise 


Loop:  for  /c  =  1 . . .  |r(/i)| 

t  ^  T{h)[k] 

if  t  ==  1 
then  for  i  G  {1,  2} 


// 1  is  the  1®*  timestep 


else  if  (t  —  1)  ==  r(/i)  [/c  —  1]  //  the  previous  timestep  is  also  being  optimized 

then  for  i  G  {1,  2} 

5^{i)  ^  [maxwell, 2}  5^1 
^^{i)  ^  arg  maxwell, 2} 

else  for  i  e  {1, 2}  //the  previous  timestep  belongs  to  a  different  state 

Termination: 

For  all  subsequences  of  r(/i)  that  are  contiguous  in  {1 . . .  r},  backtrack  through  from 
the  end  of  the  subsequence  to  its  beginning  to  retrieve  the  corresponding  portion  of  H^. 

return  HI. 

The  running  time  is  clearly  linear  in  the  number  of  non-determined  timesteps  |r(/i)  |  since 
each  maximization  in  the  algorithm  is  always  over  2  elements  no  matter  how  large  the 
actual  HMM  gets.  Note  that  we  allow  the  incoming  and  outgoing  transition  parameters 
of  hi,  /i2  to  be  updated  as  well,  which  allows  better  modeling  of  dynamic  structure  during 
split  design. 


Split-State  Baum- Welch 

Split-State  Baum- Welch  also  learns  locally  optimal  values  for  the  state-split  parameters 
XhiM'  Baum- Welch  it  does  so  by  modeling  the  posterior  over  the  hidden  state 

space  which  in  this  case  consists  of  {hi,h2}.  The  following  two  steps  are  repeated  for 
each  iteration  i  until  convergence: 


1.  E-step:  Calculate  stepwise  occupancy  and  transition  probabilities  {7^,  from 

compute  expectations. 

2.  M-step: 

^  argmaxAP(X^(/i),i?*^(;,)  |  X\hA) 
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The  E-step  step  is  earried  out  using  a  speeialized  two-state  partially-constrained-path  ver¬ 
sion  of  the  forward-baekward  algorithm  to  first  ealeulate  the  forward  and  backward  vari¬ 
ables,  which  then  give  the  required  7^  and  values  for  hi  and  /i2.  The  idea  is  the  same  as 
Split-State  Viterbi  but  with  soft  counts  and  updates.  The  entire  algorithm  is  0{m\T{h)\), 
since  the  update  step  requires  summations  over  all  observations  in  r(/i)  for  at  least  m 
transition  parameters.  Since  it  is  not  possible  to  compute  the  updated  overall  likelihood 
from  this  split  algorithm,  convergence  of  Split-State  Baum- Welch  is  heuristically  based  on 
differences  in  the  split-state  a  variables  in  successive  timesteps. 

Though  slower,  Split-State  Baum-Welch  searches  a  larger  space  of  candidate  models 
than  Split-State  Viterbi  Training,  performing  better  on  ‘difficult’  splits  (such  as  the  one 
required  in  Figure  4.1  with  high  hidden  variable  entropy)  just  as  Baum-Welch  performs 
better  than  Viterbi  Training  in  such  situations. 

4.3.3  Efficient  Candidate  Scoring  and  Selection 

We  compare  candidates  amongst  each  other  using  the  fast-to-compute  Viterbi  likelihood 
after  optimizing  the  split  parameters.  Afterwards,  as  we  described  earlier,  we  compare 
the  best  candidate  to  the  original  model  using  BIG  score  [41]  (a.k.a.  Schwarz  criterion), 
an  efficiently  computable  approximation  of  the  true  posterior  probability.  The  latter  is 
intractable  since  it  involves  integrating  over  exponentially  many  possible  models.  The 
BIG  score  is  an  asymptotically  accurate  estimate  (in  the  limit  of  large  data)  of  the  pos¬ 
terior  probability  of  a  model  when  assuming  a  uniform  prior.  A  Laplace  approximation 
is  applied  to  the  intractable  integral  in  the  likelihood  term,  and  terms  that  do  not  depend 
on  data  set  size  are  dropped  to  obtain  the  approximated  posterior  log-probability.  BIG 
effectively  punishes  complexity  by  penalizing  the  number  of  free  parameters,  thus  safe¬ 
guarding  against  overfitting,  while  rewarding  goodness-of-fit  via  the  data  log-likelihood. 
Let  denote  the  number  of  free  parameters  in  HMM  A.  Recall  that  r  denotes  the  length 
of  the  data  sequence,  and  n  denotes  the  number  of  discrete  observations  (or  dimensionality 
of  the  real- valued  observation  vector).  Then, 

BIC(A,X)  =  logP(X  I  A)  - 
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For  V-STACS,  the  likelihood  is  approximated  by  the  Viterbi  path  likelihood  using  the 
Viterbi  approximation  [42,  43].  The  unsplit  model  Viterbi  path  ean  be  updated  efficiently 
to  compute  the  Viterbi  paths  and  Viterbi  path  likelihoods  for  each  candidate  in  0{T/m) 
amortized  time,  and  hence  0{t)  total.  This  avoids  an  extra  0{Tm?‘)  step  in  V-STACS,  and 
keeps  the  complexity  of  V-STACS’  candidate  generation,  scoring  and  selection  at  0{mT). 

BIC  as  defined  holds  for  probabilistic  models  with  observable  random  variables.  More 
recent  work  has  shown  that  the  effective  dimension  of  latent  variable  Bayesian  Networks 
is  equal  to  the  rank  of  the  Jacobian  of  the  transformation  between  the  parameters  of  the 
latent  variables  and  the  parameters  of  the  observable  variables  [45].  Using  the  number 
of  free  parameters  is  a  reasonable  approximation  since  this  corresponds  to  the  maximum 
possible  rank,  and  regular  BIC  with  this  measure  has  been  successfully  used  for  model 
selection  in  HMMs  in  previous  work  [37].  Nonetheless,  due  to  the  approximate  nature  of 
BIC  for  latent-variable  models  [45]  or  for  high-dimensional  data,  test-set  log-likelihood 
would  be  a  more  accurate  though  more  computationally  expensive  scoring  criterion  for 
splits. 


4.4  Experiments 

In  our  experiments  we  seek  to  compare  STACS  and  V-STACS  to  Li-Biswas,  ML-SSS,  and 
multi-restart  Baum- Welch,  in  these  areas: 

1 .  Learning  models  of  predetermined  size 

2.  Model  selection  capability 

3.  Classification  accuracy  for  multi-class  sequential  data 

For  (1)  and  (2),  we  are  concerned  both  with  the  quality  of  models  learned  as  indicated  by 
test-set  likelihoods  (for  learning  predetermined- size  models)  and  BIC  scores  (for  learning 
state  space  dimension),  as  well  as  running-time  efficiency.  For  (3),  we  examine  sequential 
data  where  each  sequence  is  associated  with  a  distinct  class  label.  For  such  data,  HMMs 
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have  been  sueeessfully  used  in  previous  applieations  (e.g.  speeeh  reeognition  [4])  to  eon- 
struet  a  classifier  by  training  elass-speeifie  HMMs  on  sequenees  from  different  elasses, 
and  using  their  likelihood  seores  on  test-set  sequenees  for  elassifieation.  We  will  follow 
the  same  proeedure  for  the  multielass  sequential  data  we  examine. 

4.4.1  Algorithms  and  Data  Sets 

As  deseribed  earlier,  STAGS  uses  Baum-Weleh  for  parameter  learning  and  Split-State 
Baum-Weleh  for  split  design.  V-STACS  uses  Viterbi  Training  and  Split-State  Viterbi 
Training,  followed  by  Baum-Weleh  on  the  final  model.  We  also  implemented  ML-SSS 
(with  a  tweak  for  generalizing  to  non-ehain  topologies)  and  Li-Biswas  for  eomparison. 

We  ehoose  a  wide  range  of  real-world  datasets  that  have  appeared  in  previous  work 
in  various  eontexts,  the  goal  being  to  examine  the  ability  of  our  algorithms  to  uneover 
hidden  strueture  in  many  different,  realistie  domains.  The  dimensionality  of  all  datasets 
was  redueed  by  PC  A  to  5  for  effieieney,  exeept  for  the  Motionlogger  dataset  whieh  is  2-D. 
The  AUSL  and  Vowel  data  sets  are  from  the  UCI  KDD  arehive  [46].  We  list  the  data  sets 
with  (training-set,  test-set)  sizes  below. 

Robot:  This  data  set  eonsists  of  laser-range  data  provided  by  the  Radish  Roboties  Data 
set  Repository  [47]  gathered  by  a  Pioneer  indoor  robot  traversing  multiple  runs 
of  a  elosed-loop  set  of  eorridors  in  USC’s  Salvatori  Computer  Seienee  building. 
This  data  set  has  appeared  in  previous  work  in  relation  to  a  robot  loealization  task. 
Among  the  four  runs  of  data  provided  we  used  three  for  training  (12,  952  observa¬ 
tions)  and  one  for  testing  (4,  052  observations). 

Mlog:  This  data  set  eonsists  of  real-valued  motion  data  from  two  wearable  aeeelerometers 
worn  by  a  test  subjeet  for  a  period  of  several  days  during  daily  routine.  The  training 
set  and  test  set  eontain  10, 000  and  4, 720  observations  respeetively.  This  data  set 
was  previously  appeared  in  a  paper  on  large-state-spaee  HMMs  [48]  and  allows  us 
to  eompare  our  test-set  likelihoods  with  previous  results. 

Mocap:  This  is  a  small  subset  of  real- valued  motion  eapture  data  gathered  by  several 
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Table  4.1:  Test-set  log-likelihoods  (scaled  by  dataset  size)  and  training  times  of  HMMs 
learned  using  STAGS, V-STACS,  ML-SSS  and  regular  Baum-Welch  with  5  random 
restarts.  The  best  score  and  fastest  time  in  each  row  are  highlighted.  Li-Biswas  had  similar 
results  as  ML-SSS,  and  slower  running  times,  for  those  m  where  it  completed  successfully. 


Data 

STAGS 

V-STACS 

ML-SSS 

Baum-Welch 

STAGS 

V-STACS 

ML-SSS 

Baum-Welch 

m  =  5 

m 

=  40 

ROBOT 

-2.41 

-2.41 

-2.44 

-2.47 

-1.75 

-1.76 

-1.80 

-1.78 

40s 

13s 

70s 

99s 

11790s 

1875s 

16048s 

18460s 

MOCAP 

-4.46 

-4.46 

-4.49 

-4.46 

-4.32 

-4.29 

-4.30 

-4.37 

34s 

14s 

49s 

65s 

5474s 

1053s 

6430s 

7315s 

MLOG 

-8.78 

-8.78 

-10.49 

-8.78 

-8.25 

-8.26 

-10.49 

-8.38 

67s 

15s 

9s 

750s 

29965s 

8146s 

1818s 

42250s 

AUSL 

-3.60 

-3.60 

-3.60 

-3.43 

-2.89 

-2.77 

-3.08 

-2.99 

39s 

14s 

33s 

110s 

7923s 

1550s 

8465s 

22145s 

VOWEL 

-4.69 

-4.69 

-4.68 

-4.67 

-4.34 

-4.32 

-4.44 

-4.33 

13s 

8s 

37s 

95s 

2710s 

1011s 

2874s 

6800s 

m  =  20 

m 

=  60 

ROBOT 

-1.93 

-1.93 

-1.98 

-1.96 

-1.65 

-1.64 

-1.69 

-1.75 

2368s 

512s 

2804s 

4890s 

38696s 

6086s 

51527s 

35265s 

MOCAP 

-4.38 

-4.37 

-4.33 

-4.33 

-4.23 

-4.26 

-4.23 

-4.46 

899s 

203s 

800s 

3085s 

16889s 

3470s 

18498s 

20950s 

MLOG 

-8.34 

-8.34 

-10.49 

-8.40 

-8.25 

-8.23 

-8.29 

-8.39 

3209s 

1173s 

284s 

12350s 

116891s 

29379s 

108358s 

87150s 

AUSL 

-3.16 

-3.18 

-3.21 

-3.13 

-2.71 

-2.71 

-2.86 

-2.89 

1128s 

284s 

1410s 

3655s 

23699s 

4613s 

25156s 

60035s 

VOWEL 

-4.40 

-4.41 

-4.44 

-4.41 

-4.30 

-4.31 

-4.44 

-4.31 

548s 

189s 

1009s 

1285s 

8296s 

2714s 

4407s 

13360s 
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human  subjects  instrumented  with  motion  trackers  and  recorded  by  cameras  while 
performing  sequences  of  physical  movements.  It  appears  in  previous  work  [49] 
in  the  context  of  a  classification  task  on  natural  and  unnatural  motion  sequences; 
the  best  model  for  the  task  was  found  to  be  an  HMM  ensemble  model  and  the  state 
spaces  required  ranged  from  50  —  60  to  180  states,  which  makes  it  a  natural  candidate 
for  inclusion  here.  We  use  a  training  set  of  size  10,  028  and  test  set  of  size  5, 159. 

AUSL:  This  data  set  consists  of  high-quality  instrumented  glove  readings  of  Australian 
sign-language  words  being  expressed  by  an  expert  signer.  The  data  set  contains  27 
repetitions  each  of  95  different  words,  with  each  sign  consisting  of  around  50  22- 
dimensional  observations.  Here  we  concatenate  signings  of  10  different  words  to 
form  a  training  set  of  size  13, 435  and  a  test  set  of  size  1,  771. 

Vowel:  This  data  set  consists  of  multiple  utterances  of  a  particular  Japanese  vowel  by 
nine  male  speakers.  We  broke  it  up  into  training  and  test  sets  of  4,  274  and  5,  687 
datapoints  each. 

Synthetic  data:  We  generated  synthetic  data  sets  to  examine  the  ability  of  our  algorithms 
to  uncover  the  true  number  of  states. 

4.4.2  Learning  HMMs  of  Predetermined  Size 

We  first  evaluate  performance  in  learning  models  of  predetermined  size.  In  Table  4. 1  we 
show  test- set  log-likelihoods  normalized  by  data  set  size  along  with  running  times  for 
experiments  using  STAGS  and  V-STACS  along  with  ML-SSS  and  regular  Baum- Welch 
with  5  restarts.  Figure  4.4  shows  a  subset  of  the  same  data  in  a  more  visually  interpretable 
form  for  the  m  =  40  case.  Here  we  ignore  the  stopping  criterion  and  perform  the  best  split 
at  each  model  selection  step  until  we  reach  m  states.  For  Baum- Welch,  the  best  score  from 
its  five  runs  is  given  along  with  the  total  time.  Li-Biswas  results  are  not  shown  because 
most  desired  model  sizes  were  not  reached.  However,  the  instances  that  did  successfully 
complete  indicate  that  Li-Biswas  is  much  slower  than  any  other  method  considered,  even 
Baum- Welch,  while  learning  models  with  similar  scores  as  ML-SSS. 
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Table  4.2:  BIC  scores  scaled  by  dataset  size,  and  (number  of  states),  of  final  models  chosen 
by  STAGS,  V-STACS,  Li-Biswas  and  ML-SSS.  STAGS  and  V-STAGS  consistently  find 
larger  models  with  better  BIG  scores,  indicating  more  effective  split  design. 


Dataset 

STAGS 

V-STAGS 

Li-Biswas 

ML-SSS 

ROBOT 

-1.79(39) 

-1.81(34) 

-1.98(78) 

-2.01(75) 

MOGAP 

-3.54(36) 

-3.55(33) 

-3.69(20) 

-3.92(70) 

MLOG 

-S.44(14) 

-8.45(20) 

-8.59(77) 

-10.51(7) 

AUSL 

-2.77(44) 

-2.79(42) 

-2.92(37) 

-3.04(28) 

VOWEL 

-4.47(17) 

-AA9(16) 

-4.48(77) 

-4.94(7) 

We  note  that  STAGS  and  V-STAGS  have  the  fastest  running  times  for  any  given  HMM 
size  and  data  set,  except  for  cases  when  a  competing  algorithm  got  stuck  and  terminated 
prematurely.  Figure  4.3(A)  shows  a  typical  example  of  STAGS  and  V-STAGS  running 
times  compared  to  previous  methods  for  different  m  values. 

As  m  grows  larger  and  the  possibility  of  local  minima  increases,  STAGS  and  V-STAGS 
consistently  return  models  with  better  test-set  scores.  Figure  4.3(B)  shows  training-set 
score  against  running  time  for  the  Robot  data  for  m  =  40.  This  is  especially  remarkable 
for  V-STAGS  which  is  a  purely  hard-updates  algorithm.  One  possible  explanation  is  that 
V-STAGS’  coarseness  helps  it  avoid  overfitting  when  splitting  states. 


4.4.3  Model  Selection  Accuracy  with  BIC 

The  final  BIG  scores  and  m  values  of  STAGS,  V-STAGS,  Li-Biswas  and  ML-SSS  are 
shown  in  Table  4.2  when  allowed  to  stop  splitting  autonomously.  Note  that  the  exact  BIG 
score  was  not  used  by  V-STAGS,  just  calculated  for  comparison.  Figure  4.5  presents  part 
of  the  data  in  more  visually  interpretable  form.  In  all  cases,  STAGS  converges  on  models 
with  the  highest  BIG  score.  For  the  Mlog  data,  STAGS  achieves  a  better  BIG  score 
even  with  a  smaller  model  than  V-STAGS,  indicating  that  the  soft-updates  method  found  a 
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particularly  good  local  optimum. 

The  consistent  superiority  of  STAGS  here  may  seem  to  eontradiet  results  from  the 
previous  seetion  where  STAGS  and  V-STAGS  were  seen  to  be  more  eomparable.  A  pos¬ 
sible  reason  is  that  V-STAGS  uses  the  Viterbi  path  likelihood  (whieh  is  eomputed  during 
hard-updates  training  anyway)  in  plaee  of  the  true  likelihood  in  BIG.  This  is  done  to  keep 
V-STAGS  as  effieient  as  possible.  However  the  resulting  approximate  BIG  seems  to  under¬ 
value  good  splits,  resulting  in  early  stoppage  as  seen  here.  We  ean  eonelude  that,  though 
the  Viterbi  approximation  works  well  for  state-splitting,  the  true  likelihood  is  preferable 
for  model  seleetion  purposes  when  using  BIG. 

4.4.4  Discovering  the  Correct  Topology 

We  already  saw  that  STAGS  is  able  to  learn  the  eorreet  number  of  states  in  the  simple 
example  of  Figure  4.1,  while  Li-Biswas  and  ML-SSS  are  not.  We  generalized  this  ex¬ 
ample  to  a  larger,  more  diffieult  instanee  by  generating  a  10,  000  point  synthetie  data  set 
(Figure  4.6(A))  similar  to  the  one  in  Figure  4.1  but  with  10  hidden  states  with  overlapping 
Gaussian  observations. 

Even  on  this  data,  both  STAGS  and  V-STAGS  eonsistently  found  the  true  underlying 
10-state  model  whereas  Li-Biswas  and  ML-SSS  eould  not  do  so.  Interestingly,  regular 
Baum-Weleh  on  a  10-state  HMM  also  failed  to  find  the  best  eonfiguration  of  these  10 
states  even  after  50  restarts.  This  reinforees  a  notion  suggested  by  results  in  Seetion  4.4.2: 
even  in  fixed-size  HMM  learning,  STAGS  is  more  effeetive  in  avoiding  local  minima  than 
multi-restart  Baum-Weleh. 

4.4.5  Australian  Sign-Language  Recognition 

Though  improved  test-set  likelihood  is  strong  evidenee  of  good  models,  it  is  also  important 
to  see  whether  these  model  improvements  translate  into  superior  performanee  on  tasks 
sueh  as  elassifieation.  HMMs  play  one  of  their  most  important  roles  in  the  eontext  of 
supervised  elassifieation  and  reeognition  systems,  where  one  HMM  is  trained  for  eaeh 
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Table  4.3:  Australian  sign-language  word  reeognition  aeeuraey  on  a  95-word  elassifieation 
task,  and  average  HMM  sizes,  on  AUSL  data. 


STAGS 

V-STACS 

Li-Biswas 

ML-SSS 

90.9% 

95.8% 

78.6% 

89.5% 

12.5 

55 

8.3 

8.5 

distinet  sequenee  elass.  Classifieation  is  earried  out  by  seoring  a  test  sequenee  with  eaeh 
HMM,  and  the  sequenee  is  labeled  with  the  elass  of  the  highest-seoring  HMM. 

One  sueh  elassifieation  problem  is  automatic  sign-language  recognition  [50].  We  test 
the  effeetiveness  of  our  automatieally  learned  HMMs  at  elassifieation  of  Australian  sign 
language  using  the  AUSL  dataset  [1].  The  data  eonsists  of  sensor  readings  from  a  pair 
of  Floek  instrumented  gloves  (Figure  4.6(B)),  for  27  instanees  eaeh  of  95  distinet  words. 
Eaeh  instanee  is  roughly  55  timesteps.  We  retained  the  {x,  y,  z,  roll,  pitch,  yaw)  signals 
from  eaeh  hand  resulting  in  12 -dimensional  sequential  data.  We  trained  HMMs  on  an 
8:1  split  of  the  data,  using  STAGS,  V-STACS,  Li-Biswas  and  ML-SSS.  Table  4.3  shows 
elassifioation  results  along  with  average  HMM  sizes.  V-STACS  yields  the  highest  aeeuraey 
along  with  mueh  larger  HMMs  than  the  other  algorithms. 


4.5  Application:  Event  Detection  in  Unstructured  Audio 

Analysis  of  unstruetured  audio  seene  data  has  a  variety  of  applieations  sueh  as:  ethno- 
musicology,  i.e.,  musie  elassifioation  based  on  oultural  style  [51];  audio  diarization,  i.e., 
extraotion  of  speeoh  segments  in  long  audio  signals  from  baekground  sounds  [52];  audio 
event  detection  [53]  for  audio  mining;  acoustic  surveillance  [54],  espeeially  for  military 
and  publio  safety  applieations,  e.g,  in  urban  searoh  and  resoue  soenarios;  and  human  robot 
interaction  [55],  for  voioe  aotivated  robot  aetuation  and  oontrol. 

For  all  these  applieations,  aoourate  separation  of  speeoh  from  non-speeeh  signals,  or 
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background  noise,  is  a  fundamental  task  that  ean  be  effectively  solved  by  applying  var¬ 
ious  sequence  elassification  algorithms.  One  very  popular  and  effeetive  elassifioation 
seheme  [56]  is  based  on  HMMs.  HMMs  ean  eapture  the  underlying  hidden  states  in  the 
time-series  audio  data  and  also  model  the  transitions  in  these  states  to  represent  the  under¬ 
lying  dynamieal  system.  Traditionally,  HMMs  for  these  applications  have  been  learned 
by  using  iterative  parameter  learning  approaches  sueh  as  Baum-Weleh  (EM)  [4].  While 
these  approaches  have  had  some  suceess,  due  to  limitations  of  Baum- Welch  they  have 
also  struggled  with  issues  of  eomputational  eomplexity,  reliability  and  scalability.  One 
reason  for  this  is  that  Baum-Weleh  does  not  aid  in  the  diseovery  of  an  appropriate  number 
of  states,  which  is  usually  done  heuristically  or  exhaustively.  Sinee  new  data  is  obtained 
rapidly  and  the  setting  ean  ehange  over  time,  HMM  training  has  to  be  highly  efficient  for 
this  application,  in  addition  to  being  aeeurate.  We  applied  STAGS  and  V-STACS  to  the 
task  of  learning  models  for  detecting  a  variety  of  phenomena  from  raw  audio  data  over 
time.  The  data  was  obtained  from  MobileFusion,  Inc.,  a  Pittsburgh-based  eompany  whieh 
(at  the  time  this  researeh  was  undertaken)  was  developing  a  eommercial  audio-event  de- 
teetion  software  along  with  an  underlying  hardware  platform.  Here  we  describe  the  results 
of  this  application. 


4.5.1  Data  and  Preprocessing 

Figure  4.7(A)  shows  the  portable  sensor  platform  where  real-time  audio-event  detection 
is  to  be  carried  out.  For  our  experiments,  audio  examples  were  eolleeted  using  the  de- 
viee  shown  in  Figure  4.7(B),  and  were  seleeted  with  the  intent  to  cover  a  wide  variety  of 
sounds  representative  of  the  respective  elasses  of  audio  events.  For  instanee,  the  examples 
produced  to  support  Human  class  models  included  instances  of  male,  female  and  children 
voiees  speaking  in  various  languages  and  environments  including  offices,  coffee  shops 
and  urban  outdoors.  The  Animal  class  examples  included  sounds  of  dogs,  owls,  wolves, 
coyotes,  roosters  and  birds.  The  elass  of  Ground  Vehieles  included  sample  sounds  of  mo- 
torcyeles,  cars  and  trucks  passing  by.  The  Aerial  Vehieles,  our  smallest  elass  with  just  20 
samples,  included  sounds  of  jets  and  helieopters.  The  Explosions  elass  included  sounds  of 


47 


machine-gun  fire,  grenade  blasts  and  gunshots.  The  Baekground  elass  eovers  other  sounds 
sueh  as  oeean  waves,  eity  traffie,  rain,  thunder,  erowds  plus  some  silenee  samples.  Details 
of  the  data  are  given  in  Table  4.4. 

Sinee  elaborate,  eomputationally  expensive  feature  extraetion  methods  are  not  possible 
during  online  audio  event  detention,  we  applied  minimal  preproeessing  to  the  raw  sound 
reeordings  in  order  to  prepare  data  for  audio  seene  experiments.  We  extraeted  a  set  of  13 
Mel-frequeney  eepstral  eoeffieients  (whieh  is  effieient)  followed  by  assigning  elass  labels 
to  eaeh  of  the  examples.  The  eepstral  features  were  then  averaged  every  5  timesteps  to 
reduee  noise  and  the  resulting  values  were  sealed  up  by  a  faetor  of  10  to  avoid  numerieal 
issues  in  the  multivariate  Gaussian  eode. 

Table  4.4:  Data  Speeifieations.  The  last  row  shows  the  total  number  of  samples,  overall 
average  number  of  timesteps  per  sample,  and  total  duration  in  minutes  and  seeonds. 


Class 

% 

#  samples 

Av.  len.  (T) 

Duration 

Human 

27 

128 

209 

40:11 

Animal 

17 

78 

37 

4:58 

Ground  V. 

17 

79 

47 

6:57 

Aerial  V. 

4 

20 

54 

2:16 

Explosion 

18 

87 

11 

1:34 

Baekgmd. 

17 

83 

21 

12:58 

100% 

475 

78 

68:54 

4.5.2  Results 

The  elassifier  was  tested  using  4-fold  eross-validation  on  the  dataset  deseribed  above.  Fig¬ 
ure  4.8  and  Table  4.5  summarize  the  results.  We  trained  HMMs  on  the  data  using  STAGS, 
V-STACS  and  EM  (Baum-Weleh).  We  used  the  number  of  states  diseovered  by  V-STACS 
for  EM,  whieh  were  56,  26,  26,  18,  14  and  34  on  average  for  the  6  elasses  respeetively.  To 
help  EM  eseape  loeal  minima,  we  re-initialized  it  5  times  from  different  random  starting 
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points.  Training  was  carried  out  on  a  1.8GHz  CPU  with  4GB  RAM.  In  Figure  4.8(A), 
the  training  times  are  plotted  in  log  scale  of  minutes,  sinee  the  disparity  between  EM  and 
our  algorithms  was  so  large.  For  example,  for  the  Baekground  elass,  V-STACS  took  166 
minutes,  STAGS  took  243  minutes,  and  EM  took  1023  minutes,  despite  having  to  only 
learn  parameters  for  a  fixed  HMM  size. 

We  now  look  at  elassifieation  aeeuraey.  We  foeus  on  the  elassifieation  aeeuraey  results 
of  V-STACS  vs.  EM.  STAGS  results  were  similar  to  V-STACS  though  slightly  poorer  in 
some  eases.  Figure  4.8(B)  shows  these  results  for  eaeh  of  the  6  elasses,  whieh  shows  that 
V-STACS  is  eonsistently  more  aeeurate  than  EM,  exeept  for  labels  sueh  as  Explosions 
and  Baekground.  We  believe  this  is  partly  due  to  laek  of  suffieient  training  data  (note  in 
Table  4.4  that  these  elasses  had  the  shortest  average  sample  lengths).  Table  4.5  shows  the 
eonfusion  matriees  for  models  trained  using  EM  (top)  and  V-STACS  (bottom)  respeetively. 
For  eaeh  pair  of  aetual  and  predieted  labels,  the  better  value  of  the  two  tables  is  listed 
in  bold  (higher  is  better  on  diagonal,  lower  is  better  off-diagonal).  Due  to  the  highly 
unstruetured,  diverse  and  in  some  eases  insuffieient  data,  neither  algorithm  is  eonsistently 
aeeurate.  However  V-STACS  has  better  results  on  the  whole,  partieularly  for  the  Human 
elass  whieh  is  of  partieular  interest  in  this  applieation.  Some  of  the  false  negatives  and 
false  positives  seen  in  the  eonfusion  matrix  ean  be  addressed  by  using  non-uniform  priors 
during  elassifieation,  to  eompensate  for  the  large  degree  of  elass  imbalanee. 


4.6  Discussion 

Part  of  the  eontribution  of  this  work  is  empirieal  evidenee  for  the  eonjeeture  that  better 
modeling  of  state  eontext  eompensates  more  than  adequately  for  eonsidering  fewer  data 
points  per  split  in  hidden  state  diseovery.  In  addition,  we  investigated  whether  improved 
dynamie  modeling  in  split  design  ean  also  eompensate  for  approximating  hidden  state 
beliefs  by  less  preeise,  more  effieient  hard-updates  methods  via  the  V-STACS  algorithm. 
Evaluations  show  that  even  V-STACS  produees  models  with  higher  test-set  seores  than 
soft-updates  methods  like  ME-SSS,  Ei-Biswas  and  Multi-restart  Baum-Weleh.  The  eom- 
putational  effieieney  of  our  methods  ean  make  a  big  differenee  in  a  number  of  praetieal 
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Table  4.5:  Average  Confusion  Matrix  for  EM  (top)  and  V-STACS  (bottom).  Aetual  (rows) 
vs  Predieted  (eolumns).  Eaeh  entry  is  a  pereentage  of  test  data  averaged  over  the  eross- 
validation  runs,  and  eaeh  row  sums  to  100.  Eor  eaeh  entry,  the  better  entry  of  the  two 
tables  is  in  bold.  Ties  are  in  italics. _ 


H 

A 

G.V. 

A.V. 

E 

B 

H. 

92.69 

4.68 

0 

0 

0 

2.34 

A. 

17.10 

73.68 

0 

1.31 

0 

7.89 

G.V. 

9.21 

3.94 

69.73 

2.63 

2.63 

11.84 

A.V. 

5 

0 

15.00 

75 

5 

0 

E. 

9.52 

1.19 

8.33 

0 

75 

5.95 

B. 

15 

8.75 

5 

0 

3.75 

67.5 

H. 

96.09 

2.43 

0.78 

0 

0.78 

0 

A. 

17.10 

81.57 

0 

0 

1.31 

0 

G.V. 

9.21 

3.94 

72.36 

0 

3.94 

10.52 

A.V. 

5 

0 

5.00 

80 

10 

0 

E. 

14.28 

2.38 

4.76 

2.38 

72.61 

3.57 

B. 

13.75 

11.25 

7.5 

0 

3.75 

63.75 

applieations  sueh  as  audio  event-deteetion  (Seetion  4.5),  where  frequent  on-the-fly  retrain¬ 
ing  is  often  neeessary. 

An  interesting  phenomenon  observed  is  that  STAGS  and  V-STACS  eonsistently  re¬ 
turn  larger  HMMs  (with  better  BIC  seores)  than  ME-SSS  and  Ei-Biswas  when  stopping 
autonomously.  One  possible  interpretation  is  that  the  split  design  meehanism  eontinues 
to  find  ‘good’  splits  even  after  the  ‘easy’  splits  are  exhausted.  An  illustration  of  this 
for  the  Moeap  data  is  in  Eigure  4.3.C.  This  makes  sense  eonsidering  that  model  selee- 
tion  and  parameter  optimization  are  elosely  related;  sinee  parameter  seareh  is  prone  to 
loeal  minima,  determining  the  best  size  of  a  model  depends  on  being  able  to  find  regions 
of  parameter  spaee  where  good  eandidate  models  reside.  Similarly,  it  is  surprising  that 
that  V-STACS  yielded  the  highest  elassifieation  aeeuraey  in  the  sign-language  reeognition 
task  (Seetion  4.4.5),  that  too  with  mueh  larger  final  HMMs  than  any  other  algorithm.  V- 
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STAGS  also  performed  slightly  better  in  the  audio  event  deteetion  task  (Seetion  4.5).  More 
investigation  is  needed  in  this  area  to  see  if  these  two  things  hold  true  in  other  sequenee 
elassifieation  domains,  and  if  so  then  why. 

Previous  work  for  finding  the  dimensionality  of  HMMs  with  discrete-valued  observa¬ 
tions  [42]  and  other  Bayesian  Networks  with  discrete  observations  [57]  has  demonstrated 
that  hard-updates  model  selection  algorithms  can  yield  much  greater  efficiency  than  soft- 
updates  methods  without  a  large  loss  of  accuracy.  To  our  knowledge,  however,  this  is  the 
first  work  that  demonstrates  hard-updates  model  selection  to  be  competitive  for  continu¬ 
ous  observations  (Sections  4.4.2,  4.4.5).  One  possible  explanation  is  that  the  coarseness 
of  hard-updates  splitting  helps  avoid  overfitting  early  on  which  might  otherwise  trap  the 
algorithm  in  a  local  optimum.  It  should  also  be  noted  that  STAGS  can  be  generalized  to 
model  selection  in  Dynamic  Bayesian  Networks  or  other  directed  graphical  models  that 
have  hidden  states  of  undetermined  cardinality,  since  the  sum-product  and  max-product 
algorithms  for  inference  and  learning  in  these  models  are  generalizations  of  the  Baum- 
Welch  and  Viterbi  algorithms  for  HMMs. 

Results  from  Sections  4.4.2  and  4.4.4  indicate  that  STAGS  is  also  a  competitive 
size  HMM  learning  algorithm  compared  to  previous  approaches  in  terms  of  test-set  scores 
and  efficiency.  To  our  knowledge,  this  is  the  first  HMM  model  selection  algorithm  that 
can  make  this  claim.  Gonsequently,  there  is  great  potential  for  applying  STAGS  to  do¬ 
mains  where  continuous  observation-density  HMMs  of  fixed  size  are  used,  such  as  speech 
recognition,  handwriting  recognition,  financial  modeling  and  bioinformatics. 

In  the  context  of  this  thesis,  STAGS  provides  a  more  efficient  way  of  learning  dis¬ 
crete  LVMs  that  also  have  significantly  fewer  local-minima  issues  than  those  learned  us¬ 
ing  EM.  The  improved  learning  algorithm  translates  to  better  test-set  likelihood  and  hence 
increased  predictive  power  in  the  resulting  HMMs. 

However,  though  the  STAGS  approach  to  learning  HMMs  is  strictly  better  than  EM 
and  other  existing  model  selection  methods,  it  does  have  drawbacks  due  to  its  heuristic 
components.  The  greedy  binary  splitting  approach  could  lead  to  oversplitting  in  some 
scenarios,  or  miss  other,  better  partitionings  of  the  state  space.  The  BIG  score  is  only  an 
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approximation  to  the  true  posterior,  and  hence  is  only  an  approximate  scoring  and  stopping 
criterion.  Furthermore,  regular  BIC  is  not  entirely  suitable  for  temporal  models  [45].  A 
scoring  criterion  based  on  variational  Bayes  [58]  might  be  a  better  option,  though  scoring 
and  stopping  based  on  test-set  likelihood,  as  mentioned  earlier,  would  be  best.  In  Chap¬ 
ter  6  we  see  a  different  approach  to  learning  HMMs  using  matrix  decomposition  methods. 
Unlike  STAGS  and  EM-based  approaches,  this  algorithm  does  not  suffer  from  local  min¬ 
ima.  It  also  simplifies  model  selection  to  a  simple  examination  of  singular  values  obtained 
from  SVD.  Such  matrix  decomposition  methods  for  learning  LVMs  are  more  commonly 
used  in  learning  continuous  LVMs  in  the  form  of  subspace  methods,  which  we  describe 
and  improve  upon  in  the  next  chapter. 
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Figure  4.3:  A.  Running  time  vs.  number  of  final  states  on  the  Robot  data  set.  B.  Final 
scaled  log-likelihood  (nats/datapoint)  vs.  number  of  states  for  learning  fixed-size  models 
on  the  Robot  data.  C.  Log-Likelihood  vs.  running  time  for  learning  a  40-state  model  on 
the  Robot  data.  D.  BIC  score  vs.  running  time  on  the  Mocap  data  when  allowed  to 
stop  splitting  autonomously.  Results  are  typical  of  those  obtained  on  all  data  sets,  shown 
mostly  on  the  Robot  dataset  because  it  allowed  the  largest  (N  =  40)  Li-Biswas  HMMs. 
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Robot 
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Vowel 


Figure  4.4:  Learning  40-state  HMMs.  Top:  sealed  test-set  loglikelihoods  of  learned  mod¬ 
els.  Bottom:  Faetor  of  running  time  speedup  w.r.t.  slowest  algorithm  for  learning. 
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Figure  4.5:  Learning  state  space  dimensions  and  parameters.  Top:  scaled  BIC  scores  of 
learned  models.  Bottom:  Final  HMM  state  space  size. 
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Figure  4.6:  A.  Fragment  of  10-state  univariate  synthetic  data  sequence  used  for  model 
selection  testing.  The  histogram  shows  only  7  distinct  peaks,  indicating  that  some  of 
the  observation  densities  overlap  completely.  B.  Flock  bDT  instrumented  glove  used  to 
collect  Australian  Sign  Language  data  [1]  used  in  our  classification  experiments. 


Figure  4.7:  (A)  A  mobile  tactical  device  and  (B)  a  fixed  device  on  which  our  algorithms 
were  deployed 
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(B) 


Figure  4.8:  (A)  Training  times  in  log(minutes).  V-STACS  and  STAGS  are  at  least  an  order 
of  magnitude  faster  than  5-restart  EM.  (B)  Classification  accuracies.  VSTACS  is  better 
than  EM  in  nearly  all  cases. 
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Chapter  5 


Learning  Stable  Linear  Dynamical 
Systems 


In  this  chapter  we  shift  our  foeus  from  diserete  to  eontinuous  latent  variable  models.  We 
extend  Subspaee  Identifieation'  (Seetion  3.3.2),  a  popular  alternative  to  the  EM  algorithm 
for  learning  EDS  parameters  whieh  originates  in  the  eontrols  literature.  EM  is  statistically 
efficient:  it  promises  to  find  an  optimum  point  of  the  observed  data  likelihood  for  finite 
amounts  of  training  data.  On  the  downside,  however,  EM  is  computationally  inefficient: 
eaeh  run  finds  only  a  local  optimum  of  this  likelihood  funetion,  and  henee  we  need  many 
random  re-initializations  to  seareh  parameter  spaee  for  the  global  optimum.  Subspaee 
ID  reformulates  the  seareh  spaee  by  trading  off  a  small  amount  of  statistical  effieieney 
in  return  for  a  large  inerease  in  computational  effieieney.  Though  Subspaee  ID  does  not 
reaeh  an  optimum  point  for  finite  data  samples,  it  promises  to  reaeh  the  global  optimum 
of  the  observed  data  likelihood  in  the  limit  of  suffieient  data.  Subspaee  ID  is  simpler  and 
easier  to  implement  than  EM  as  well,  eonsisting  of  a  singular  value  deeomposition  (SVD) 
of  a  matrix  in  eontrast  to  the  repeated  forward-baekward  iterations  of  EM. 

An  additional  diffieulty  in  learning  EDSs  as  opposed  to  HMMs  is  that  standard  learning 
algorithms  ean  result  in  models  with  unstable  dynamies,  whieh  eauses  them  to  be  ill-suited 

Though  our  extension  applies  to  the  EM  algorithm  too,  as  we  describe  later 
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for  several  important  tasks  sueh  as  simulation  and  long-term  prediction.  This  problem  ean 
arise  even  when  the  underlying  dynamieal  system  emitting  the  data  is  stable,  partieularly 
if  insuffieient  training  data  is  available,  whieh  is  often  the  ease  for  high-dimensional  tem¬ 
poral  sequenees.  In  this  ehapter  we  propose  an  extension  to  Subspaee  ID  that  enforees 
the  estimated  parameters  to  be  stable.  Though  stability  is  a  non-eonvex  eonstraint,  we 
will  see  how  a  eonstraint-generation-based  optimization  approaeh  yields  approximations 
to  the  optimal  solution  that  are  more  effieient  and  more  aeeurate  than  previous  state-of- 
the-art  stabilizing  methods. 


5.1  Introduction 

We  propose  an  optimization  algorithm  for  learning  the  dynamies  matrix  of  an  LDS  while 
guaranteeing  stability.  We  first  obtain  an  estimate  of  the  underlying  state  sequenee  using 
subspaee  identifieation.  We  then  formulate  the  least-squares  minimization  problem  for  the 
dynamies  matrix  as  a  quadratic  program  (QP)  [59],  initially  without  eonstraints.  When  we 
solve  this  QP,  the  estimate  A  we  obtain  may  be  unstable.  However,  any  unstable  solution 
allows  us  to  derive  a  linear  eonstraint  whieh  we  then  add  to  our  original  QP  and  re-solve. 
This  eonstraint  is  a  eonservative  approximation  to  the  true  feasible  region.  The  above 
two  steps  are  iterated  until  we  reaeh  a  stable  solution,  whieh  is  then  refined  by  a  simple 
interpolation  to  obtain  the  best  possible  stable  estimate.  The  overall  algorithm  is  illustrated 
in  Figure  5.1(A). 

Our  method  ean  be  viewed  as  constraint  generation  [60]  for  an  underlying  eonvex 
program  with  a  feasible  set  of  all  matriees  with  singular  values  at  most  1,  similar  to  work 
in  eontrol  systems  sueh  as  [2] .  This  eonvex  set  approximates  the  true,  non-eonvex  feasible 
region.  So,  we  terminate  before  reaehing  feasibility  in  the  eonvex  program,  by  eheeking 
for  matrix  stability  after  eaeh  new  eonstraint.  This  makes  our  algorithm  less  eonservative 
than  previous  methods  for  enforeing  stability  sinee  it  ehooses  the  best  of  a  larger  set  of 
stable  dynamies  matriees.  The  differenee  in  the  resulting  stable  systems  is  notieeable 
when  simulating  data.  The  eonstraint  generation  approaeh  also  results  in  mueh  greater 
effieieney  than  previous  methods  in  nearly  all  eases. 
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One  application  of  LDSs  in  computer  vision  is  learning  dynamic  textures  from  video 
data  [61].  An  advantage  of  learning  dynamic  textures  is  the  ability  to  play  back  a  realistic- 
looking  generated  sequence  of  desired  duration.  In  practice,  however,  videos  synthesized 
from  dynamic  texture  models  can  quickly  become  degenerate  because  of  instability  in  the 
underlying  LDS,  or  because  of  the  competitive  inhibition  problems  discussed  in  Chapter  1 . 
In  contrast,  sequences  generated  from  dynamic  textures  learned  by  our  method  remain 
“sane”  even  after  arbitrarily  long  durations,  although  we  leave  the  problem  of  competitive 
inhibition  to  Chapter  6.  We  also  apply  our  algorithm  to  learning  baseline  dynamic  models 
of  over-the-counter  (OTC)  drug  sales  for  biosurveillance,  and  sunspot  numbers  from  the 
UCR  archive  [62].  Comparison  to  the  best  alternative  methods  [2,  63]  on  these  problems 
yields  positive  results. 


5.2  Related  Work 

Linear  system  identification  is  a  well-studied  subject  [26].  Within  this  area,  subspace 
identification  methods  [27]  have  been  very  successful.  These  techniques  first  estimate 
the  model  dimensionality  and  the  underlying  state  sequence,  and  then  derive  parameter 
estimates  using  least  squares.  Within  subspace  methods,  techniques  have  been  developed 
to  enforce  stability  by  augmenting  the  extended  observability  matrix  with  zeros  [64]  or 
adding  a  regularization  term  to  the  least  squares  objective  [65]. 

All  previous  methods  were  outperformed  by  Lacy  and  Bernstein  [2],  henceforth  re¬ 
ferred  to  as  LB-l.  They  formulate  the  problem  as  a  semidefinite  program  (SDP)  whose 
objective  minimizes  the  state  sequence  reconstruction  error,  and  whose  constraint  bounds 
the  largest  singular  value  by  1 .  This  convex  constraint  is  obtained  from  the  nonlinear  ma¬ 
trix  inequality  —  AA^  A  0,  where  is  the  nx  n  identity  matrix  and  A  0  (A  0)  denotes 
positive  (semi-)  definiteness.  This  can  be  seen  as  follows:  the  inequality  bounds  the  top 
singular  value  by  1  since  it  implies  for  all  vectors  x 

—  AAJ)x  >  0 
x'^  AAJx  <  x'^x 
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Therefore  this  statement  holds  for  v  =  i>i{AA^).  Define  A  =  Ai(Ay4^).  Then, 


^  \v  <  1 
(j'^(A)  <  1 

where  the  last  step  follows  from  the  faet  that  =  1  and  crf{M)  =  Ai(MM^)  for 
any  square  matrix  M.  The  existenee  of  this  eonstraint  also  proves  the  eonvexity  of  the 
(Ji  <  1  region.  This  eondition  is  sufficient  but  not  necessary  for  stability,  sinee  a  matrix 
that  violates  this  eondition  may  still  be  stable. 

A  follow-up  to  this  work  by  the  same  authors  [63],  whieh  we  will  eall  LB-2,  attempts 
to  overeome  the  eonservativeness  of  LB-1  by  approximating  the  Lyapunov  inequalities 
P  —  APA^  0,  P  y  0  with  the  inequalities  P  —  APA^  —  6In  P  0,  P  —  6In  ^  0,  5  >  0. 
These  inequalities  hold  iff  the  speetral  radius  is  less  than  However,  the  approximation 
is  aehieved  only  at  the  eost  of  indueing  a  nonlinear  distortion  of  the  objeetive  funetion  by 
a  problem-dependent  reweighting  matrix  involving  P,  whieh  is  a  variable  to  be  optimized. 
In  our  experiments,  this  eauses  LB-2  to  perform  worse  than  LB-1  (for  any  6)  in  terms  of 
the  state  sequenee  reeonstruetion  error,  even  while  obtaining  solutions  outside  the  feasi¬ 
ble  region  of  LB-l.  Consequently,  we  foeus  on  LB-1  in  our  eoneeptual  and  qualitative 
eomparisons  as  it  is  the  strongest  baseline  available.  However,  LB-2  is  more  sealable  than 
LB-1,  so  quantitative  results  are  presented  for  both. 

To  summarize  the  distinetion  between  LB-1  and  LB-2:  it  is  hard  to  have  both  the  right 
objeetive  funetion  (reeonstruetion  error)  and  the  right  feasible  region  (the  set  of  stable 
matriees).  LB-1  optimizes  the  right  objeetive  but  over  the  wrong  feasible  region  (the  set 
of  matriees  with  cxi  <  1).  LB-2  has  a  feasible  region  elose  to  the  right  one,  but  at  the  eost 
of  distorting  its  objeetive  funetion  to  an  extent  that  it  fares  worse  than  LB-1  in  nearly  all 
eases. 

^For  a  proof  sketch,  see  [32]  pg.  410. 
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5.3  The  Algorithm 


The  overall  algorithm  we  propose  is  quite  simple.  We  first  formulate  the  dynamics  ma¬ 
trix  learning  problem  as  a  QP  with  a  feasible  set  that  includes  the  set  of  stable  dynamics 
matrices.  Then,  if  the  unconstrained  solution  is  unstable,  we  demonstrate  how  unstable 
solutions  can  be  used  to  generate  linear  constraints  that  are  added  to  restrict  the  feasible 
set  of  the  QP  appropriately.  The  QP  is  then  re-solved  and  the  constraint  generation  loop 
is  repeated  until  we  reach  a  stable  solution.  As  a  final  step,  the  solution  is  refined  to  be  as 
close  as  possible  to  the  unconstrained-objective-minimizing  estimate  while  remaining  sta¬ 
ble.  The  overall  algorithm  is  illustrated  in  Figure  5.1(A).  Note  that  the  linear  constraints 
can  eliminate  subsets  of  the  set  of  stable  matrices  from  the  solution  space,  so  the  final 
solution  is  not  necessarily  the  optimal  one.  However,  with  respect  to  LB- land  LB-2,  our 
method  optimizes  the  right  objective  (unlike  LB-2)  over  a  less  conservative  feasible  region 
which  includes  some  stable  matrices  with  ai  >  1  (unlike  LB-1).  Optimizing  over  the  right 
feasible  region  (spectral  radius  <  1)  is  hard,  for  reasons  we  will  see  in  Section  5.3.2. 

We  now  elaborate  on  the  different  steps  of  the  algorithm,  namely  how  to  formulate  the 
objective  (Section  5.3.1),  generate  constraints  (Section  5.3.3),  compute  a  stable  solution 
(Section  5.3.4)  and  then  refine  it  (Section  5.3.5). 


5.3.1  Formulating  the  Objective 


Assume  the  notation  and  formulations  of  Chapter  3.  In  subspace  ID  as  well  as  in  the  M- 
step  of  an  iteration  of  EM,  it  is  possible  to  write  the  objective  function  for  A  as  a  quadratic 
function.  For  subspace  ID  we  define  a  quadratic  objective  function: 

F 

^  'AX  -X.+i 


A  =  arg  min 

A 


A 


=  arg  mm  <  tr 

^  I 


i^AX,  - 


=  arg  mjn  <j  tr  ( AX^X,'  A '  )  -  2tr  ( XiX.'^.A )  +  tr  ( X,^,Xi+, 


=  arg  min  |a'''Pa  —  2  g'^^a  -f  r} 


(5.1a) 
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where  a  G  q  G  P  G 


and  r  G  M  are  defined  as: 


a  =  Vec(A)  =  [All  A21  A31  •  ■  ■  Ann] 
P=T^I  V 


(5.1b) 
(5.1c) 
(5.1d) 
(5.1e) 

In  is  the  n  X  n  identity  matrix  and  (g)  denotes  the  Kronecker  product.  Note  that  P  is  a 
symmetric  positive  semi-definite  matrix  and  the  objective  function  in  Equation  (5.1a)  is  a 
quadratic  function  of  a.  For  EM,  we  can  use  a  similar  quadratic  objective  function: 


q  =  vec(XiX4i) 
r  =  tr  ( 


where  a  G  q  G 


A  =  argmin  \a^Pa  —  2  q^a] 
and  P  G  are  defined  as: 

a  =  vec(A)  =  [An  A21  A31  ■  ■  ■  Ann]^ 


P  =  I 


t=2 

g  =  vec  [  ^  Pt-i,i 


(5.2a) 

(5.2b) 

(5.2c) 

(5.2d) 


t=2 


Here,  Pt  and  Pt-i,t  are  taken  directly  from  the  E-step  of  EM. 


5.3.2  Convexity 

The  feasible  set  of  the  quadratic  objective  function  is  the  space  of  all  n  x  n  matrices, 
regardless  of  their  stability.  When  its  solution  yields  an  unstable  matrix,  the  spectral  radius 
of  A  is  greater  than  1.  Ideally  we  want  to  constrain  the  solution  space  to  the  set  of  stable 
matrices,  i.e.  the  set  of  matrices  with  spectral  radius  at  most  one  (which  we  call  S\). 
However,  it  is  not  possible  to  formulate  a  convex  optimization  routine  that  optimizes  over 
this  set  because  of  the  shape  of  Sx.  Consider  the  class  of  2  x  2  matrices  [66]:  = 

[0.3  a]  {3  0.3  ].  The  matrices  Pio,o  and  Po,io  are  stable  with  Ai  =  0.3,  but  their  convex 
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Figure  5.1:  (A):  Conceptual  depiction  of  the  space  of  nxn  matrices.  The  region  of  stability 
(S'a)  is  non-convex  while  the  smaller  region  of  matrices  with  ai  <  1  (S^)  is  convex.  The 
elliptical  contours  indicate  level  sets  of  the  quadratic  objective  function  of  the  QP.  A  is 
the  unconstrained  least-squares  solution  to  this  objective.  Alb-i  is  the  solution  found  by 
LB-1  [2].  One  iteration  of  constraint  generation  yields  the  constraint  indicated  by  the 
line  labeled  ‘generated  constraint’,  and  (in  this  case)  leads  to  a  stable  solution  A*.  The 
final  step  of  our  algorithm  improves  on  this  solution  by  interpolating  A*  with  the  previous 
solution  (in  this  case,  A)  to  obtain  (B):  The  actual  stable  and  unstable  regions  for 

the  space  of  2  x  2  matrices  =  [0.3  a  ]  (3  0.3  ],  with  a,{3  e  [—10, 10].  Constraint 
generation  is  able  to  learn  a  nearly  optimal  model  from  a  noisy  state  sequence  of  length  7 
simulated  from  Eq  iq,  with  better  state  reconstruction  error  than  either  LB-1  or  LB-2. 

combination  yi^io.o  +  (1  —  7)-Eo,io  is  unstable  for  (e.g.)  7  =  0.5  (Figure  5.1(B)).  This 
shows  that  the  set  of  stable  matrices  is  non-convex  for  n  =  2,  and  in  fact  this  is  true 
for  all  n  >  1.  The  problem  of  optimizing  over  this  set  is  hence  a  difficult  non-convex 
optimization  routine.  We  turn  instead  to  the  largest  singular  value,  which  is  a  closely 
related  quantity  since 

^  min  (A)  <  |A.(A)|  < 

^  max  (A)  Vi  =  l,...,n  [32] 

Therefore  every  unstable  matrix  has  a  singular  value  greater  than  one,  but  the  converse  is 
not  necessarily  true.  Moreover,  the  set  of  matrices  with  ui  <  1  is  convex.  To  see  this,  note 
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that  for  any  square  matrix  M, 


ai{M)  =  max  u^Mv. 

u,i;:||u||2=l,||t'||2=l 

Therefore,  if  cti(Mi)  <  1  and  ai{M2)  <  1,  then  for  all  eonvex  eombinations, 


+  (1  —  7)M2)  =  max  Miv  +  (1  —  M2V  <  1. 

Figure  5.1(A)  eoneeptually  depiets  the  non-eonvex  region  of  stability  S\  and  the  eonvex 
region  with  ai  <  1  in  the  spaee  of  all  n  x  n  matriees  for  some  fixed  n.  The  differenee 
between  S„  and  S'a  ean  be  signifieant.  Figure  5.1(B)  depiets  these  regions  for  with 
a,{3  E  [—10, 10].  The  stable  matriees  i?io,o  and  i?o,io  reside  at  the  edges  of  the  figure.  Our 
algorithm  is  designed  to  mitigate  the  differenee  between  S\  and  Sa  by  stopping  before  it 
reaehes  S^-  While  one  might  worry  that  the  differenee  is  too  severe  to  mitigate  this  way, 
and  results  do  vary  based  on  the  instanee  used,  our  experiments  below  will  show  that  our 
eonstraint  generation  algorithm  deseribed  below  is  able  to  learn  a  nearly  optimal  model 
from  a  noisy  state  sequenee  of  r  =  7  simulated  from  Eq^iq,  with  better  state  reeonstruetion 
error  than  LB-1  and  LB -2. 

5.3.3  Generating  Constraints 

We  now  deseribe  how  to  generate  eonvex  eonstraints  on  the  set  S^-  The  basie  idea  is  to 
use  the  unstable  solution  A  along  with  properties  of  matriees  in  the  set  to  infer  a  linear 
eonstraint  between  A  and  S^-  Assume  that 

A  = 

by  SVD,  where  U  =  ^  =  [^*]r=i  ^  =  diag{5-i, . . . ,  dn}-  Then: 

A  =  ufy^  ^  E  =  U'^AV^  ai{A)  =ujAvi=tr{ujAvi)  (5.3) 
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Therefore,  instability  of  A  implies  that: 


cTi  >  1  ^  tr 


>  1  ^  tr  (viujAj  >  1  ^  g'’'a  >  1  (5.4) 


Here  g  =  vec{uivj).  Since  equation  (5.4)  arises  from  an  unstable  solution  of  equa¬ 
tion  (5.1a),  g  is  can  be  interpreted  as  a  hyperplane  separating  a  from  the  space  of  matrices 
with  (Ti  <  1.  In  fact,  the  hyperplane  is  tangent  to  §cr.  We  use  the  negation  of  equation  (5.4) 
as  a  constraint: 


(5.5) 


5.3.4  Computing  the  Solution 

Given  the  above  mechanism  for  generating  convex  constraints  on  the  set  So-,  a  constraint 
generation-based  convex  optimization  algorithm  immediately  suggests  itself.  The  overall 
quadratic  program  can  be  stated  as: 

minimize  aJPa  —  2  +  r 

(5.6) 

subject  to  Ga  <  h 

with  a,  P,  q  and  r  as  defined  in  Eqs.  (5.1e).  {G,  h}  define  the  set  of  constraints,  and  are 
initially  empty.  The  QP  is  invoked  repeatedly  until  the  stable  region,  i.e.  Sx,  is  reached. 
At  each  iteration,  we  calculate  a  linear  constraint  of  the  form  in  Eq.  (5.5),  add  the  corre¬ 
sponding  g'^  as  a  row  in  G,  and  augment  the  h  vector  with  a  1  at  the  end.  Note  that  we  will 
almost  always  stop  before  reaching  the  feasible  region  So-. 

5.3.5  Refinement 

Once  a  stable  matrix  is  obtained,  it  is  possible  to  refine  this  solution.  We  know  that  the 
last  constraint  caused  our  solution  to  cross  the  boundary  of  Sx,  so  we  interpolate  between 
the  last  solution  and  the  previous  iteration’s  solution  using  binary  search  to  look  for  a 
boundary  of  the  stable  region,  in  order  to  obtain  a  better  objective  value  while  remaining 
stable.  This  results  in  a  stable  matrix  with  top  eigenvalue  equal  to  1.  In  principle,  we 


67 


could  attempt  to  interpolate  between  any  stable  solution  and  any  one  of  the  unstable  so¬ 
lutions  from  previous  iterations.  However,  the  stable  region  ean  be  highly  eomplex,  and 
there  may  be  several  folds  and  boundaries  of  the  stable  region  in  the  interpolated  area.  In 
our  experiments  (not  shown),  interpolating  from  the  Laey-Bernstein  solution  to  the  last 
unstable  solution  yielded  worse  results.  We  also  tried  other  interpolation  and  eonstraint- 
relaxation  methods  sueh  as:  interpolating  from  the  least  squares  solution  to  the  first  stable 
solution,  dropping  eonstraints  added  earlier  in  the  eonstraint-generation  proeess,  and  ex¬ 
panding  the  eonstrained  set  by  multiplying  the  h  veetor  by  a  eonstant  greater  than  one.  All 
these  methods  yielded  worse  results  overall  than  the  algorithm  presented  here. 


5.4  Experiments 

For  learning  the  dynamies  matrix,  we  implemented  EM,  subspaee  identifieation,  eonstraint 
generation  (using  quadprog),  LB-1  [2]  and  LB-2  [63]  (using  CVX  with  SeDuMi)  in  Mat- 
lab  on  a  3.2  GHz  Pentium  with  2  GB  RAM.  Note  that  the  algorithms  that  eonstrain  the 
solution  to  be  stable  give  a  different  result  from  the  basie  EM  and  and  subspaee  ID  algo¬ 
rithms  only  in  situations  when  the  uneonstrained  A  is  unstable.  However,  LDSs  learned 
in  searee-data  seenarios  are  unstable  for  almost  any  domain,  and  some  domains  lead  to 
unstable  models  up  to  the  limit  of  available  data  (e.g.  the  steam  dynamie  textures  in 
Seetion  5.4.1).  The  goals  of  our  experiments  are  to:  (1)  examine  the  state  evolution  and 
simulated  observations  of  models  learned  using  eonstraint  generation,  and  eompare  them 
to  previous  work  on  learning  stable  dynamieal  systems;  and  (2)  eompare  the  algorithms 
in  terms  of  eomputational  effieieney.  We  apply  these  algorithms  to  learning  dynamie  tex¬ 
tures  from  the  vision  domain  (Seetion  5.4.1),  modeling  over-the-eounter  (OTC)  drug  sales 
eounts  (Seetion  5.4.3)  and  sunspot  numbers  (Seetion  5.4.4). 

5.4.1  Stable  Dynamic  Textures 

Dynamie  textures  in  vision  ean  intuitively  be  deseribed  as  models  for  sequenees  of  images 
that  exhibit  some  form  of  low-dimensional  strueture  and  reeurrent  (though  not  neeessarily 
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Figure  5.2:  Dynamic  textures.  A.  Samples  from  the  original  steam  sequence  and  the 
fountain  sequence.  B.  State  evolution  of  synthesized  sequences  over  1000  frames 
(steam  top,  fountain  bottom).  The  least  squares  solutions  display  instability  as  time 
progresses.  The  solutions  obtained  using  LB-1  remain  stable  for  the  full  1000  frame  im¬ 
age  sequence.  The  constraint  generation  solutions,  however,  yield  state  sequences  that 
are  stable  over  the  full  1000  frame  image  sequence  without  significant  damping.  C.  Sam¬ 
ples  drawn  from  a  least  squares  synthesized  sequences  (top),  and  samples  drawn  from  a 
constraint  generation  synthesized  sequence  (bottom).  The  constraint  generation  synthe¬ 
sized  steam  sequence  is  qualitatively  better  looking  than  the  steam  sequence  gener¬ 
ated  by  LB-1,  although  there  is  little  qualitative  difference  between  the  two  synthesized 
fountain  sequences. 


repeating)  characteristics,  e.g.  fixed-background  videos  of  rising  smoke  or  flowing  water. 
Treating  each  frame  of  a  video  as  an  observation  vector  of  pixel  values  yt,  we  learned 
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CG 

LB-1 

LB-1* 

LB-2 

CG 

LB-1 

LB-1* 

LB-2 

steam  (n  =  10) 

fountain (n  = 

10) 

|Ai| 

1.000 

0.993 

0.993 

1.000 

0.999 

0.987 

0.987 

0.997 

0-1 

1.036 

1.000 

1.000 

1.034 

1.051 

1.000 

1.000 

1.054 

e.(%) 

45.2 

103.3 

103.3 

546.9 

0.1 

4.1 

4.1 

3.0 

time 

0.45 

95.87 

3.77 

0.50 

0.15 

15.43 

1.09 

0.49 

steam  (n  =  20) 

fountain (n  = 

20) 

|Ai| 

0.999 

— 

0.990 

0.999 

0.999 

— 

0.988 

0.996 

0-1 

1.037 

— 

1.000 

1.062 

1.054 

— 

1.000 

1.056 

e.(%) 

58.4 

— 

154.7 

294.8 

1.2 

— 

5.0 

22.3 

time 

2.37 

— 

1259.6 

33.55 

1.63 

— 

159.85 

5.13 

steam  (n  =  30) 

fountain (n = 

30) 

|Ai| 

1.000 

— 

0.988 

1.000 

1.000 

— 

0.993 

0.998 

0-1 

1.054 

— 

1.000 

1.130 

1.030 

— 

1.000 

1.179 

e.(%) 

63.0 

— 

341.3 

631.5 

13.3 

— 

14.9 

104.8 

time 

8.72 

— 

23978.9 

62.44 

12.68 

— 

5038.94  48.55 

steam  (n  =  40) 

fountain (n = 

40) 

|Ai| 

1.000 

— 

0.989 

1.000 

1.000 

— 

0.991 

1.000 

0-1 

1.120 

— 

1.000 

1.128 

1.034 

— 

1.000 

1.172 

e.(%) 

20.24 

— 

282.7 

768.5 

3.3 

— 

4.8 

21.5 

time 

5.85 

— 

79516.98 

289.79 

61.9 

— 

43457.77  239.53 

Table  5.1:  Quantitative  results  on  the  dynamic  textures  data  for  different  numbers  of  states 
n.  CG  is  our  algorithm,  LB-land  LB-2  are  competing  algorithms,  and  LB-1*  is  a  simula¬ 
tion  of  LB-1  using  our  algorithm  by  generating  constraints  until  we  reach  S^,  since  LB-1 
failed  for  n  >  10  due  to  memory  limits.  Cx  is  percent  difference  in  squared  reconstruction 
error.  Constraint  generation,  in  all  cases,  has  lower  error  and  faster  runtime. 
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Figure  5.3:  Bar  graphs  illustrating  decreases  in  objective  function  value  relative  to  the 
least  squares  solution  (A,B)  and  the  running  times  (C,D)  for  different  stable  LDS  learning 
algorithms  on  the  f ountain(A,C)  and  steam(B,D)  textures  respectively,  based  on  the 
corresponding  columns  of  Table  5.1. 


dynamic  texture  models  of  two  video  sequences:  the  steam  sequence,  composed  of  120  x 
170  pixel  images,  and  the  fountain  sequence,  composed  of  150  x  90  pixel  images, 
both  of  which  originated  from  the  MIT  temporal  texture  database  (Figure  5.2(A)).  We  use 
parameters  r  =  80,  n  =  15,  and  d  =  10.  Note  that,  while  the  observations  are  the  raw 
pixel  values,  the  underlying  state  sequence  we  learn  has  no  a  priori  interpretation. 

An  LDS  model  of  a  dynamic  texture  may  synthesize  an  “infinitely”  long  sequence  of 
images  by  driving  the  model  with  zero  mean  Gaussian  noise.  Each  of  our  two  models  uses 
an  80  frame  training  sequence  to  generate  1000  sequential  images  in  this  way.  To  better 
visualize  the  difference  between  image  sequences  generated  by  least-squares,  LB-1,  and 
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constraint  generation,  the  evolution  of  eaeh  method’s  state  is  plotted  over  the  course  of  the 
synthesized  sequenees  (Figure  5.2(B)).  Sequenees  generated  by  the  least  squares  models 
appear  to  be  unstable,  and  this  was  in  faet  the  ease;  both  the  steam  and  the  fountain 
sequenees  resulted  in  unstable  dynamies  matriees.  Conversely,  the  eonstrained  subspaee 
identifieation  algorithms  all  produeed  well-behaved  sequenees  of  states  and  stable  dynam¬ 
ies  matriees  (Table  5.1),  although  eonstraint  generation  demonstrates  the  fastest  runtime, 
best  sealability,  and  lowest  error  of  any  stability-enforeing  approaeh. 

A  qualitative  eomparison  of  images  generated  by  eonstraint  generation  and  least  squares 
(Figure  5.2(C))  indieates  the  effeet  of  instability  in  synthesized  sequenees  generated  from 
dynamie  texture  models.  While  the  unstable  least-squares  model  demonstrates  a  dramatie 
and  unrealistie  inerease  in  image  eontrast  over  time,  the  eonstraint  generation  model  eon- 
tinues  to  generate  qualitatively  reasonable  images.  Qualitative  eomparisons  between  eon¬ 
straint  generation  and  LB-1  indieate  that  eonstraint  generation  learns  models  that  generate 
more  natural-looking  video  sequenees^  than  LB-1. 

Table  5.1  demonstrates  that  eonstraint  generation  always  has  the  lowest  error  as  well  as 
the  fastest  runtime.  The  running  time  of  eonstraint  generation  depends  on  the  number  of 
eonstraints  needed  to  reaeh  a  stable  solution.  Note  that  LB-1  is  more  effieient  and  sealable 
when  simulated  using  eonstraint  generation  (by  adding  eonstraints  until  is  reaehed) 
than  it  is  in  its  original  SDP  formulation. 

Figure  5.3  shows  bar  graphs  eomparing  reeonstruetion  errors  and  running  times  of 
these  algorithms,  based  on  eolumns  of  Table  5.1,  illustrating  the  large  difference  in  effi- 
eieney  and  aeeuraey  between  eonstraint  generation  and  eompeting  methods. 


5.4.2  Prediction  Accuracy  on  Robot  Sensor  Data 

Another  important  measure  of  aeeuraey  of  dynamie  models  is  their  performanee  on  short¬ 
term  and  long-term  prediction.  This  problem  was  addressed  in  the  eontext  of  stable  LDS 
modeling  in  Byron  Boots’  CMU  Masters  thesis  [3],  whieh  uses  the  teehniques  deseribed 

^See  videos  at  http://www.select.cs.cmu.edu/projects/stableLDS 
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Figure  5.4:  Prediction  accuracy  on  Robot  Sensory  Data  (from  Boots  (2009)  [3]).  A.  The 
mobile  robot  with  camera  and  laser  sensors.  B.  The  environment  and  robot  path.  C.  An 
image  from  the  robot  camera.  D.  A  depiction  of  laser  range  scan  data  (green  dots  on 
environment  surfaces).  E.  Predictive  log-likelihoods  from:  the  unstable  model  (Unstable), 
constraint  generation  (CG),  and  the  two  other  EDS  stabilizing  algorithms,  EB-1  and  EB-2. 
Bars  at  every  20  timesteps  denote  variance  in  the  results.  CG  provides  the  best  stable  short 
term  predictions,  nearly  mirroring  the  unstable  model,  while  all  three  stabilized  models  do 
better  than  the  unstable  model  in  the  long  term. 


in  this  chapter  for  models  with  exogenous  control  inputs,  and  also  combines  the  constraint 
generation  algorithm  with  EM  (Section  3.3.1)  in  the  way  indicated  earlier  in  this  chapter. 
The  experiment  and  result  is  summarized  in  this  section  since  the  original  document  is 
difficult  to  access  outside  CMU.  Vision  data  and  laser  range  scans  were  collected  from  a 
Point  Grey  Bumblebee2  stereo  camera  and  a  SICK  laser  rangefinder  mounted  on  a  Botrics 
0-bot  dlOO  mobile  robot  platform  (Eigure  5.4(A))  circling  an  obstacle  in  an  indoor  envi- 
ronment(Eigure  5.4(B)).  After  collecting  video  and  laser  data  (Eigure  5.4(C,D)),  the  pixel 
and  range  reading  data  were  vectorized  at  each  timestep  and  concatenated.  After  centering 
and  scaling  to  align  the  variances,  and  SVD  to  reduce  dimensionality,  a  sequence  of  2000 
10-dimensional  processed  observations  was  obtained. 

Eor  the  experiment,  15  sequences  of  200  frames  were  used  to  learn  models  of  the 
environment  via  EM  initialized  by  subspace  ID;  of  these  models  10  were  unstable.  The 
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unstable  model  was  stabilized  using  eonstraint  generation,  LB-1  and  LB-2.  Stabilization 
was  performed  in  the  last  M-step  of  EM  (doing  it  in  every  EM  iteration  is  more  expensive 
and  did  not  improve  performanee).  To  test  predietive  power,  filtering  was  performed  for  20 
frames  using  the  original  model,  and  from  there  on,  predietion  was  performed  for  different 
extents  ranging  from  1  to  500,  from  the  original  unstable  model  and  the  three  stabilized 
models.  This  filtering  and  predietion  was  repeated  on  1500  separate  subsequenees. 

The  resulting  plot  of  predietion  loglikelihood  over  different  predietion  extents  (Eig- 
ure  5.4(E))  shows  that  the  model  stabilized  using  eonstraint  generation  has  the  higher 
predietion  loglikelihood  of  the  unstable  model  in  the  short  term  (1-120  timesteps),  while 
in  the  long  term  all  the  stable  models  have  a  higher  predietion  loglikelihood  than  the  un¬ 
stable  model,  whose  seore  deelines  steadily  while  the  stable  models’  seore  remains  higher 
eonsistently. 

5.4.3  Stable  Baseline  Models  for  Biosurveillance 

We  examine  daily  eounts  of  OTC  drug  sales  in  pharmaeies,  obtained  from  the  National 
Data  Retail  Monitor  (NDRM)  eolleetion  [67].  The  eounts  are  divided  into  23  different 
eategories  and  are  traeked  separately  for  eaeh  zipeode  in  the  eountry.  We  foeus  on  zipeodes 
from  a  partieular  Ameriean  eity.  The  data  exhibits  7-day  periodieity  due  to  differential 
buying  patterns  during  weekdays  and  weekends.  We  isolate  a  60-day  subsequenee  where 
the  data  dynamies  remain  relatively  stationary,  and  attempt  to  learn  EDS  parameters  to  be 
able  to  simulate  sequenees  of  baseline  values  for  use  in  deteeting  anomalies.  In  prineiple, 
more  aeeurate  baseline  models  should  lead  to  better  anomaly  deteetion. 

We  perform  two  experiments  on  different  aggregations  of  the  OTC  data,  with  parame¬ 
ter  values  n  =  7,d  =  7  and  r  =  14.  Eigure  5.5(A)  plots  22  different  drug  eategories 
aggregated  over  all  zipeodes,  and  Eigure  5.5(B)  plots  a  single  drug  eategory  (eough/eold) 
in  29  different  zipeodes  separately.  In  both  oases,  eonstraint  generation  is  able  to  use  very 
little  training  data  to  learn  a  stable  model  that  eaptures  the  periodieity  in  the  data,  while 
the  least  squares  model  is  unstable  and  its  prediotions  diverge  over  time.  EB-1  learns  a 
model  that  is  stable  but  overoonstrained,  and  the  simulated  observations  quiokly  drift  from 
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the  correct  magnitudes. 


0  100  200 


Figure  5.5:  (A):  60  days  of  data  for  22  drug  categories  aggregated  over  all  zipcodes  in  the 
city.  (B):  60  days  of  data  for  a  single  drug  category  (cough/cold)  for  all  29  zipcodes  in 
the  city.  (C):  Sunspot  numbers  for  200  years  separately  for  each  of  the  12  months.  The 
training  data  (top),  simulated  output  from  constraint  generation,  output  from  the  unstable 
least  squares  model,  and  output  from  the  over-damped  LB-1  model  (bottom). 


5.4.4  Modeling  Sunspot  Numbers 

We  compared  least  squares  and  constraint  generation  on  learning  LDS  models  for  the 
sunspot  data  discussed  earlier  in  Section  3.3.2.  We  use  parameter  settings  n  =  7,d  = 
18,  T  =  50.  Figure  5.5(C)  represents  a  data-poor  training  scenario  where  we  train  a  least- 
squares  model  on  18  timesteps,  yielding  an  unstable  model  whose  simulated  observations 
increase  in  amplitude  steadily  over  time.  Again,  constraint  generation  is  able  to  use  very 
little  training  data  to  learn  a  stable  model  that  seems  to  capture  the  periodicity  in  the  data 
as  well  as  the  magnitude,  while  the  least  squares  model  is  unstable.  The  model  learned  by 
LB-1  attenuates  more  noticeably,  capturing  the  periodicity  to  a  smaller  extent.  Quantitative 
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results  on  both  these  domains  exhibit  similar  trends  as  those  in  Table  5.1. 


5.5  Discussion 

We  have  introdueed  a  novel  method  for  learning  stable  linear  dynamieal  systems.  Our 
eonstraint  generation  algorithm  is  more  powerful  than  previous  methods  in  the  sense  of 
optimizing  over  a  larger  set  of  stable  matrices  with  a  suitable  objective  function.  In  prac¬ 
tice,  the  benefits  of  stability  guarantees  are  readily  noticeable,  especially  when  the  training 
data  is  limited.  This  connection  between  stability  and  amount  of  training  data  is  impor¬ 
tant  in  practice,  since  time  series  data  is  often  expensive  to  collect  in  large  quantities, 
especially  for  phenomena  with  long  periods  in  domains  like  physics  or  astronomy.  The 
constraint  generation  approach  also  has  the  benefit  of  being  faster  than  previous  methods 
in  nearly  all  of  our  experiments.  Stability  could  also  be  of  advantage  in  planning  ap¬ 
plications.  Subspace  ID,  and  its  stable  variant  introduced  here,  provide  a  different  way  of 
looking  at  LVM  parameter  learning,  one  that  is  based  on  optimizing  the  predictive  capabil¬ 
ities  of  the  model.  Depending  on  the  number  of  observations  stacked  in  the  Hankel  matrix, 
we  could  optimize  for  short-term  or  long-term  prediction  if  the  data  domain  is  expected  or 
known  to  have  short  or  long-range  periodicity.  The  simplicity  of  the  matrix  decomposition 
approach,  and  its  natural  method  for  model  selection  by  examining  singular  values,  is  also 
very  attractive  in  contrast  to  EM-based  approaches.  These  factors  motivate  us  to  seek  an 
analogous  learning  method  for  HMM-type  models  as  well,  to  realize  these  benefits  in  the 
discrete  LVM  domain.  In  the  next  chapter  we  propose  a  variant  of  HMMs,  along  with  a 
matrix-decomposition-based  learning  algorithm  for  it  that  attains  this  goal. 
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Chapter  6 


Reduced-Rank  Hidden  Markov  Models 


So  far  we  have  examined  eontinuous-observation  Hidden  Markov  Models  (HMMs)  [4] 
and  Linear  Dynamieal  Systems  (LDSs)  [28],  whieh  are  two  examples  of  latent  variable 
models  of  dynamical  systems.  The  distributional  assumptions  of  HMMs  and  LDSs  result 
in  important  differences  in  the  evolution  of  their  belief  over  time.  The  discrete  state  of 
HMMs  is  good  for  modeling  systems  with  mutually  exclusive  states  that  can  have  com¬ 
pletely  different  observation  signatures.  The  predictive  distribution  over  observations  is 
allowed  to  be  non-log-concave  when  predicting  or  simulating  the  future,  leading  to  what 
we  call  competitive  inhibition  between  states  (see  Figure  6.4  below  for  an  example).  Com¬ 
petitive  inhibition  denotes  the  ability  of  a  model’s  predictive  distribution  to  place  proba¬ 
bility  mass  on  observations  while  disallowing  mixtures  of  those  observations.  Conversely, 
the  Gaussian  predictive  distribution  over  observations  in  LDSs  is  log-concave,  and  thus 
does  not  exhibit  competitive  inhibition.  However,  LDSs  naturally  model  smooth  state 
evolution,  which  HMMs  are  particularly  bad  at.  The  dichotomy  between  the  two  models 
hinders  our  ability  to  compactly  model  systems  that  exhibit  both  competitive  inhibition 
and  smooth  state  evolution.  In  this  chapter  we  present  the  Reduced-Rank  Hidden  Markov 
Model  (RR-HMM),  a  dynamical  system  model  which  can  perform  smooth  state  evolution 
as  well  as  competitive  inhibition.  Intuitively  the  RR-HMM  assumes  that  the  dynamical 
system  evolves  smoothly  along  a  low-dimensional  subspace  in  a  large  discrete  state  space. 
We  discuss  theoretical  connections  to  previous  work,  propose  a  learning  algorithm  with 
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finite-sample  performance  guarantees,  and  demonstrate  results  on  high-dimensional  real- 
world  sequential  data. 


6.1  Introduction 

HMMs  can  approximate  smooth  state  evolution  by  tiling  the  state  space  with  a  very  large 
number  of  low-observation-variance  discrete  states  with  a  specific  transition  structure. 
However,  inference  and  learning  in  such  a  model  is  highly  inefficient  due  to  the  large 
number  of  parameters,  and  due  to  the  fact  that  existing  HMM  learning  algorithms,  such 
as  Expectation  Maximization  (EM)  [4],  are  prone  to  local  minima.  More  sophisticated 
EM-based  algorithms  for  learning  HMMs  that  avoid  local  minima,  such  as  STAGS  (Chap¬ 
ter  4),  help  to  some  extent.  However  they  are  still  heuristics  that  do  not  offer  any  theoretical 
performance  guarantees,  and  cannot  learn  very  large-state-space  models  as  efficiently  or 
accurately  as  we  would  like  (on  the  other  hand,  STAGS  does  learn  a  positive  realization 
of  HMM  parameters  unlike  the  spectral  method  we  describe  here,  but  this  does  not  hinder 
us  from  performing  inference  and  prediction  using  the  RR-HMM  parameters  we  obtain). 
RR-HMMs  allow  us  to  reap  many  of  the  benefits  of  large-state-space  HMMs  without  in¬ 
curring  the  associated  inefficiency  during  inference  and  learning.  Indeed,  we  show  that 
all  inference  operations  in  the  RR-HMM  can  be  carried  out  in  the  low-dimensional  space 
where  the  dynamics  evolve,  decoupling  their  computational  cost  from  the  number  of  hid¬ 
den  states.  This  makes  rank-k  RR-HMMs  (with  any  number  of  states)  as  computationally 
efficient  as  k-state  HMMs,  but  much  more  expressive.  Eigure  6.1(A)  shows  an  example 
of  observations  from  a  4-state  dynamical  system  which  cannot  be  represented  by  a  3-state 
HMM  Eigure  6.1(B),  but  is  accurately  modeled  by  a  rank-3  RR-HMM  (Eigure  6.1(G))  that 
can  be  learned  using  a  single  invocation  of  the  algorithm  described  here.  A  4-state  HMM 
can  be  learned  for  this  data  (Eigure  6.1(D)).  Traditional  methods  for  learning  this  HMM, 
such  as  EM,  involves  performing  multiple  restarts  to  escape  local  minima. 

Though  the  RR-HMM  is  in  itself  novel,  its  low-dimensional  representation  is  a  spe¬ 
cial  case  of  several  existing  models  such  as  Predictive  State  Representations  [68],  Observ¬ 
able  Operator  Models  [69],  generalized  HMMs  [70]  and  multiplicity  automata  [71,  72], 
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Figure  6.1:  (A)  Observations  from  a  dynamieal  system  with  4  diserete  states  and  3  diserete 
observations,  two  of  whose  states  emit  the  observation  ‘e’  and  ean  only  be  distinguished 
by  the  underlying  dynamies.  (B)  3-state  HMMs  learned  using  EM  with  multiple  restarts 
cannot  represent  this  model,  as  evinced  by  simulations  from  this  model.  (C)  A  rank-3 
RR-HMM  estimated  using  a  single  run  of  the  learning  algorithm  described  in  this  chapter 
represents  this  4-state  model  accurately  (as  seen  from  simulations),  as  does  (D)  a  4-state 
HMM  learned  using  EM,  though  the  latter  needs  multiple  restarts  to  discover  the  overlap¬ 
ping  states  and  avoid  local  minima. 
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and  is  also  related  to  the  representation  of  LDSs  learned  using  Subspaee  Identifieation  [27]. 
These  and  other  related  models  and  algorithms  are  diseussed  further  in  Seetion  6.4. 

To  learn  RR-HMMs  from  data,  we  adapt  and  extend  a  reeently  proposed  speetral  learn¬ 
ing  algorithm  by  Hsu,  Kakade  and  Zhang  [12]  (heneeforth  referred  to  as  HKZ)  that  learns 
observable  representations  of  HMMs  using  matrix  deeomposition  and  regression  on  em- 
pirieally  estimated  observation  probability  matriees  of  past  and  future  observations.  An 
observable  representation  of  an  HMM  allows  us  to  model  sequenees  with  a  series  of  oper¬ 
ators  without  knowing  the  underlying  stoehastie  transition  and  observation  matriees.  The 
HKZ  algorithm  is  free  of  loeal  optima  and  asymptotieally  unbiased,  with  finite-sample 
bounds  on  Li  error  in  joint  probability  estimates  and  on  KL-divergenee  of  eonditional 
probability  estimates  from  the  resulting  model.  However,  the  original  algorithm  and  its 
bounds  assume  (1)  that  the  transition  model  is  full-rank  and  (2)  that  single  observations 
are  informative  about  the  entire  latent  state,  i.e.  1-step  observability.  We  generalize  the 
HKZ  bounds  to  the  low-rank  transition  matrix  ease  and  derive  tighter  bounds  that  de¬ 
pend  on  k  instead  of  m,  allowing  us  to  learn  rank-/c  RR-HMMs  of  arbitrarily  large  m  in 
0{Nk‘^)  time,  where  N  is  the  number  of  samples.  We  also  deseribe  and  test  a  method 
for  eireumventing  the  1-step  observability  eondition  by  eombining  observations  to  make 
them  more  informative.  Furthermore,  in  the  Appendix  we  also  provide  consistency  results 
which  show  that  the  learning  algorithm  in  fact  can  learn  PSRs  (more  details  on  this  are 
available  in  [73],  though  our  error  bounds  don’t  yet  generalize  to  these  models. 

Experiments  show  that  our  learning  algorithm  can  recover  the  underlying  RR-HMM  in 
a  variety  of  synthetic  domains.  We  also  demonstrate  that  RR-HMMs  are  able  to  compactly 
model  smooth  evolution  and  competitive  inhibition  in  a  clock  pendulum  video,  as  well  as 
in  real-world  mobile  robot  vision  data  captured  in  an  office  building.  Robot  vision  data 
(and,  in  fact,  most  real-world  multivariate  time  series  data)  exhibits  smoothly  evolving 
dynamics  requiring  multimodal  predictive  beliefs,  for  which  RR-HMMs  are  particularly 
suited.  We  compare  performance  of  RR-HMMs  to  LDSs  and  HMMs  on  simulation  and 
prediction  tasks.  Proofs  and  some  other  details  are  in  the  Appendix. 
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Figure  6.2:  (A)  The  graphical  model  representation  of  an  RR-HMM.  It  denotes  the  k- 
dimensional  state  vector,  ht  the  m-dimensional  discrete  state,  and  Xt  the  discrete  observa¬ 
tion.  The  distributions  over  ht  and  h+i  are  deterministic  functions  of  U.  (B)  An  illustration 
of  different  RR-HMM  parameters  and  the  spaces  and  random  variables  they  act  on.  (C) 
Projection  of  sets  of  predictive  distributions  of  a  rank  3  RR-HMM  with  10  states,  and  a 
3-state  full-rank  HMM  with  similar  parameters. 

6.1.1  Definitions 

Assume  the  HMM  notation  introduced  in  Chapter  2,  assume  T  has  rank  k  and  let  T  =  RS 
where  R  G  and  S  G  This  implies  that  the  dynamics  of  the  system  can 

be  expressed  in  rather  than  M™.  By  convention,  we  think  of  S  as  projecting  the  m- 
dimensional  state  distribution  vector  to  a  fc-dimensional  state  vector,  and  R  as  expanding 
this  low-dimensional  state  back  to  an  m-dimensional  state  distribution  vector  while  prop¬ 
agating  it  forward  in  time.  One  possible  choice  for  R  and  S  is  to  use  any  k  independent 
columns  of  T  as  the  columns  of  R,  and  let  the  columns  of  S  contain  the  coefficients  re¬ 
quired  to  reconstruct  T  from  R,  though  other  choices  are  possible  (e.g.  using  SVD).  Also 
assume  for  now  that  m  <  n  (we  relax  this  assumption  in  Section  6.2.4).  We  denote  the 
fc-dimensional  projection  of  the  hidden  state  vector  ht  as  It,  which  is  simply  a  vector  of 
real  numbers  rather  than  a  stochastic  vector.  We  assume  the  initial  state  distribution  lies  in 
the  low  dimensional  space  as  well,  i.e.  if  =  Riii  for  some  vector  if;  G  Figure  6.2(A) 
illustrates  the  graphical  model  corresponding  to  an  RR-HMM.  Figure  6.2(B)  illustrates 
some  of  the  different  RR-HMM  parameters  and  the  spaces  they  act  on. 

To  see  how  the  probability  of  a  sequence  can  be  computed  using  these  parameters,  first 
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recall  the  definition  of  G  according  to  equation  (2.1)  and  the  fact  that  T  =  RS: 

=  RS  diag(03;,.) 

and  define  Wx 

ly,  =  ^  diag{Ox,)R 

Also  let  W  =  J2x  Equation  (2.2)  shows  how  to  write  the  joint  probability  of 

a  sequence  xi, ...,  using  {Ax}.  With  the  above  definitions,  the  joint  probability  can  also 
be  written  using  {Wx}  as  well: 

Pr[xi,  ...,Xi]  =  diag(Oa:t,-)-R5'  diag(Oa:t_i,.)-R ' ' '  5'  diag(Oa:i,-)^  (from  equation  (2.2)) 

=  ll^R{S  dieig{Oxt,.)R)  {S  diag(0^,_,,.)i()  ■  ■  ■  (S'  diag(Oa;, ,.)/()  Svf 
=  iJ^RWxt  ■  ■  ■  Wx^^i  (by  definition  of  Wx,  ni)  (6.1) 


The  latter  parametrization  casts  a  rank-fc  RR-HMM  as  a  /c-dimensional  PSR  or  trans¬ 
formed  PSR  [74].  Inference  can  be  carried  out  in  0{Nk‘^)  time  in  this  representation. 
However,  since  every  HMM  is  trivially  a  PSR,  this  leads  to  the  question  of  how  expressive 
rank-/c  RR-HMMs  are  in  comparison  to  fc-state  full-rank  HMMs.  The  following  example 
is  instructive. 


6.1.2  Expressivity  of  RR-HMMs 

We  describe  a  rank- A;  RR-HMM  whose  set  of  possible  predictive  distributions  is  easy  to 
visualize  and  describe.  Consider  the  following  rank  3  RR-HMM  with  10  states  and  4 
observations.  The  observation  probabilities  in  each  state  are  of  the  form 

Oi,.  =  [Piqi  Pi{l-qi)  {1-Pi)qi  {I  -  Pi){l  -  qt)] 

for  some  0  <  p,,  g*  <  1,  which  can  be  interpreted  as  4  discrete  observations,  factored  as 
two  binary  components  which  are  independent  given  the  state.  T  and  pi,  qi  are  chosen  to 
place  the  vertices  of  the  set  of  possible  predictive  distributions  on  evenly  spaced  points 
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along  a  circle  in  {p,  g) -space: 

Tij  =  (l/2m)  [2  -|-  sin  (27rz/m)  sin  (27rj/m)  -|-  cos  cos  (27rj/m)] 

Pi  =  [sin(27ri/m)  +  1]  /2 
qi  =  [cos(27ri/m)  +  1]  / 2 

We  plot  the  marginal  probability  of  eaeh  eomponent  of  the  observation,  ranging  aeross  all 
aehievable  values  of  the  latent  state  veetor  for  the  m  =  10  ease  (Figure  6.2(C)),  yielding 
a  10-sided  polygon  as  the  projeetion  of  the  set  of  possible  predietive  distributions.  These 
distributions  are  the  eolumns  of  T^O.  We  also  plot  the  eorresponding  marginals  for  the 
m  =  3  full-rank  HMM  ease  to  yield  a  triangular  set.  More  generally,  from  a  fc-state  HMM, 
we  ean  get  at  most  a  /c-sided  polygon  for  the  set  of  possible  predietive  distributions. 

The  above  example  illustrates  that  rank-/c  RR-HMMs  with  m  states  ean  model  sets 
of  predietive  distributions  whieh  full-rank  HMMs  with  less  than  m  states  eannot  express. 
However,  as  we  shall  see,  inferenee  in  rank-/c  RR-HMMs  of  arbitrary  m  is  as  effieient  as 
inferenee  in  fc-state  full-rank  HMMs.  This  implies  that  the  additional  degrees  of  freedom 
in  the  RR-HMM’s  low-dimensional  parameters  and  state  veetors  buy  it  eonsiderable  ex¬ 
pressive  power.  Sinee  RR-HMMs  are  also  related  to  PSRs  as  pointed  out  in  the  previous 
seetion,  and  sinee  our  learning  algorithm  will  be  shown  to  be  consistent  for  estimating 
PSRs  (though  we  have  finite-sample  guarantees  only  for  the  RR-HMM  ease),  it  is  also 
instruetive  to  examine  the  expressivity  of  PSRs  in  general.  We  refer  the  reader  to  Jaeger 
(2000)  [69]  and  James  et.  al.  (2004)  [75]  for  more  on  this. 


6.2  Learning  Reduced-Rank  HMMs 

In  a  full-rank  HMM,  the  maximum  likelihood  solution  for  the  parameters  {T,  0}  ean  be 
found  through  iterative  teehniques  sueh  as  expeetation  maximization  (EM)  [15].  EM,  how¬ 
ever,  is  prone  to  loeal  optima  and  does  not  address  the  model  seleetion  problem.  STAGS 
is  better  but  still  not  guaranteed  to  return  anything  elose  to  optimal  as  data  inereases,  and 
is  slow  beyond  a  eertain  state  spaee  magnitude.  Moreover,  in  learning  RR-HMMs  we  faee 
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the  additional  challenge  of  learning  the  factors  of  its  low-rank  transition  matrix.  We  could 
use  EM  to  estimate  T  followed  by  (or  combined  with)  matrix  factorization  algorithms 
such  as  Singular  Value  Decomposition  (SVD)  [32]  or  Non-negative  Matrix  Factorization 
(NMF)  [76].  This  approach  has  several  drawbacks.  For  example,  if  the  noisy  estimate  of 
a  low-rank  transition  matrix  is  not  low-rank  itself,  SVD  could  cause  negative  numbers  to 
appear  in  the  reconstructed  transition  matrix.  Also,  algorithms  for  NMF  are  only  locally 
optimal,  and  NMF  is  overly  restrictive  in  that  it  constrains  its  factor  matrices  to  be  non¬ 
negative,  which  is  unnecessary  for  our  application  since  low-rank  transition  matrices  may 
have  negative  numbers  in  their  factors  R  and  S. 

An  alternative  approach,  which  we  adopt,  is  to  learn  an  asymptotically  unbiased  ob¬ 
servable  representation  of  an  RR-HMM  directly  using  SVD  of  a  probability  matrix  relating 
past  and  future  observations.  This  idea  has  roots  in  subspace  identification  [27,  29]  and 
multiplicity  automata  [71, 72,  70]  as  well  as  the  PSR/OOM  literature  [69,  77]  and  was  re¬ 
cently  formulated  in  a  paper  by  Hsu,  Kakade  and  Zhang  [12]  for  full-rank  HMMs.  We  use 
their  algorithm,  extending  its  theoretical  guarantees  for  the  low-rank  HMM  case  where  the 
rank  of  the  transition  matrix  T  is  A;  <  m.  Computationally,  the  only  difference  in  our  base 
algorithm  (before  Section  6.2.4)  is  that  we  learn  a  rank  k  representation  instead  of  rank 
m.  This  allows  us  learn  much  more  compact  representations  of  possibly  large-state-space 
real-world  HMMs,  and  greatly  increases  the  applicability  of  the  original  algorithm.  Even 
when  the  underlying  HMM  is  not  low -rank,  we  can  examine  the  singular  values  to  tune  the 
complexity  of  the  underlying  RR-HMM,  thus  providing  a  natural  method  for  model  selec¬ 
tion.  We  present  the  main  definitions,  the  algorithm  and  its  performance  bounds  below. 
Detailed  versions  of  the  supporting  proofs  and  lemmas  can  be  found  in  the  Appendix. 
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6.2.1  The  Algorithm 

The  learning  algorithm  depends  on  the  following  veetor  and  matrix  quantities  that  eom- 
prise  properties  of  single  observations,  pairs  of  observations  and  triples: 

[Pi]i  =  Pr[a;i  =  i] 

[P2,i]i,j  =  Pr[a:2  =  i,Xi=  j] 

[P3,x,i]  =  Pr[a:3  =  i,X2  =  x,  Xi  =  j]  for  x  =  1, . . .  ,n 

Pi  G  M”  is  a  veetor,  P2,i  G  and  P^^x,!  £  are  matriees.  These  quantities  are 
closely  related  to  matrices  computed  in  algorithms  for  learning  OOMs  [69],  PSRs  [77]  and 
LDSs  using  subspace  identification  (Subspace  ID)  [27].  They  can  be  expressed  in  terms 
of  HMM  parameters  (for  proofs  see  the  Appendix:  Lemmas  8  and  9  in  Section  A.  1.1): 

=  Zt  diag(7r)0'' 

P2,i  =  OT  diag(7r)0^ 

P3,x,i  =  OAxT  diag(7r)0’^ 

Note  that  P2,i  and  Ps^x,!  both  contain  a  factor  of  T  and  hence  are  both  of  rank  k  for  a 
rank-fc  RR-HMM.  This  property  will  be  important  for  recovering  an  estimate  of  the  RR- 
HMM  parameters  from  these  matrices.  The  primary  intuition  is  that,  when  projected  onto 
an  appropriate  linear  subspace,  Ps,*,!  is  linearly  related  to  P2,i  through  a  product  of  RR- 
HMM  parameters.  This  allows  us  to  devise  an  algorithm  that 

1.  estimates  P2,i  and  Pa,!,!  from  data, 

2.  projects  them  to  an  appropriate  linear  subspace  computed  using  SVD, 

3.  uses  linear  regression  to  estimate  the  RR-HMM  parameters  (up  to  a  similarity  trans¬ 
form)  from  these  projections. 

Specifically,  the  algorithm  attempts  to  learn  an  observable  representation  of  the  RR- 
HMM  using  a  matrix  U  E  such  that  U^OR  is  invertible.  An  observable  representa¬ 
tion  is  defined  as  follows. 
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Definition  1  The  observable  representation  is  defined  to  be  the  parameters  6i,  600,  {Bx}^=i 
such  that: 


h  =  U^Pi 
boo  =  {Pl,iU)+Pi 

Bx  =  {u^  P^,x,i){U^  P2,i)^  forx  =  l,...,n 


(6.2a) 

(6.2b) 

(6.2c) 


For  the  RR-HMM,  note  that  the  dimensionality  of  the  parameters  is  determined  by  k,  not 
m:  hi  G  600  G  and  Vx  Bx  G  Though  these  definitions  seem  arbitrary 

at  first  sight,  the  observable  representation  of  the  RR-HMM  is  elosely  related  to  the  true 
parameters  of  the  RR-HMM  in  the  following  manner  (see  Lemma  9  in  the  Appendix  for 
the  proof): 


hi  =  {U^OR)ni  = 


2-  SI  =  llR{U^OR)-\ 


3.  For  all  X  =  1, . . . ,  n  :  Bx  =  {U^OR)Wx{U^OR)-^ 

Henee  Bx  is  a  similarity  transform  of  the  RR-HMM  parameter  matrix  Wx  =  S  diag{Ox,-)R 
(which,  as  we  saw  earlier,  allows  us  to  perform  RR-HMM  inferenee),  and  hi  and  600  are 
the  eorresponding  linear  transformations  of  the  RR-HMM  initial  state  distribution  and  the 
RR-HMM  normalization  veetor.  Note  that  (U'^OR)  must  be  invertible  for  these  relation¬ 
ships  to  hold.  Together,  the  parameters  61,600  and  Bx  for  all  x  eomprise  the  observable 
representation  of  the  RR-HMM.  Our  learning  algorithm  will  estimate  these  parameters 
from  data.  The  algorithm  for  estimating  rank- A:  RR-HMMs  is  equivalent  to  the  speetral 
HMM  learning  algorithm  of  HKZ  [12]  for  learning  fc-state  HMMs.  Our  relaxation  of  their 
eonditions  (e.g.  HKZ  assume  a  full-rank  transition  matrix,  without  whieh  their  bounds  are 
vacuous),  and  our  performance  guarantees  for  learning  rank-A:  RR-HMMs,  show  that  the 
algorithm  learns  a  mueh  larger  elass  of  A:-dimensional  models  than  the  elass  of  A:-state 


HMMs. 
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Learn-RR-HMM(/c,  N)  The  learning  algorithm  takes  as  input  the  desired  rank  k  of 
the  underlying  RR-HMM  rather  than  the  number  of  states  m.  Alternatively,  given  a  sin¬ 
gular  value  threshold  the  algorithm  ean  ehoose  the  rank  of  the  HMM  by  examining  the 
singular  values  of  P2,i  in  Step  2.  It  assumes  that  we  are  given  N  independently  sampled 
observation  triples  (xi,  X2,  X3)  from  the  HMM.  In  praetiee,  we  ean  use  a  single  long  se- 
quenee  of  observations  as  long  as  we  diseount  the  bound  on  the  number  of  samples  based 
on  the  mixing  rate  of  the  HMM  (i.e.  (1  —  the  second  eigenvalue  of  T)),  in  which  case 
TT  must  correspond  to  the  stationary  distribution  of  the  HMM  to  allow  estimation  of  Pi. 
The  algorithm  results  in  an  estimated  observable  representation  of  the  RR-HMM,  with 
parameters  hi,  boo,  and  for  x  =  1, ...  ,n.  The  steps  are  briefly  summarized  here  for 
reference: 

1.  Compute  empirical  estimates  Pi,  P24,  Ps.^,!  of  Pi,  P24,  Ps.^,!  (for  x  =  1, ...,  n). 

2.  Use  SVD  on  P24  to  compute  U,  the  matrix  of  left  singular  vectors  corresponding  to 
the  k  largest  singular  values. 

3.  Compute  model  parameter  estimates: 

(a)  bi  =  U^Pi, 

(b)  boo  =  {Pl,lU)^Pl, 

(c)  =  f/’^P3,x,i(P'^P2,i)’^  (for  X  =  1, ... ,  n) 

We  now  examine  how  we  can  perform  inference  in  the  RR-HMM  using  the  observable 
representation.  For  this,  we  will  need  to  define  the  internal  state  bf  Just  as  the  parameter 
bi  is  a  linear  transform  of  the  initial  RR-HMM  belief  state,  bt  is  a  linear  transform  of  the 
belief  state  of  the  RR-HMM  at  time  t  (Lemma  10  in  Section  A. 1.1  of  the  Appendix): 

bt  =  {U^OR)k{xi..t-i)  =  {U^O)ht{xi..t-i) 

This  internal  state  bt  can  be  updated  to  condition  on  observations  and  evolve  over  time, 
just  as  we  can  update  U  for  RR-HMMs  and  lit  for  regular  HMMs. 
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6.2.2  Inference  in  the  Observable  Representation 

Given  a  set  of  observable  parameters,  we  ean  predict  the  probability  of  a  sequence,  update 
the  internal  state  bt  to  perform  filtering  and  predict  conditional  probabilities  as  follows  (see 
Lemma  10  in  the  Appendix  for  proof): 


•  Predict  sequence  probability:  Pr[a:i, . . . ,  x*]  =  •  •  •  B^^bi 

•  Internal  state  update:  bt+i  = 

»oo 

_ _  ^  ^ 

•  Conditional  probability  of  Xt  given  Pr[xt  |  xi:f_i]  = 

Estimated  parameters  can,  in  theory,  lead  to  negative  probability  estimates.  These  are 
most  harmful  when  they  cause  the  normalizers  or  to  be  negative. 

However,  in  our  experiments,  the  latter  was  never  negative  and  the  former  was  very  rarely 
negative;  and,  using  real-valued  observations  with  KDE  (as  in  Section  6.2.6)  makes  neg¬ 
ative  normalizers  even  less  likely,  since  in  this  case  the  normalizer  is  a  weighted  sum  of 
several  estimated  probabilities.  In  practice  we  recommend  thresholding  the  normalizers 
with  a  small  positive  number,  and  not  trusting  probability  estimates  for  a  few  steps  if  the 
normalizers  fall  below  the  threshold. 

Note  that  the  inference  operations  occur  entirely  in  We  mentioned  earlier  that  pa¬ 
rameterizing  RR-HMM  parameters  as  for  all  observations  x  casts  it  as  a  PSR  of  k  di¬ 
mensions.  In  fact  the  learning  and  inference  algorithms  for  RR-HMMs  proposed  here  have 
no  dependence  on  the  number  of  states  m  whatsoever,  though  other  learning  algorithms 
for  RR-HMMs  can  depend  on  m  (e.g.  if  they  learn  R  and  S  directly).  The  RR-HMM 
formulation  is  intuitively  appealing  due  to  the  idea  of  a  large  discrete  state  space  with 
low-rank  transitions,  but  this  approach  is  also  a  provably  consistent  learning  algorithm  for 
PSRs  in  general,  with  finite-sample  performance  guarantees  for  the  case  where  the  PSR 
is  an  RR-HMM.  Since  PSRs  are  provably  more  expressive  and  compact  than  finite-state 
HMMs  [69,  75],  this  indicates  that  we  can  learn  a  more  powerful  class  of  models  than 
HMMs  using  this  algorithm. 
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6.2.3  Theoretical  Guarantees 


The  following  finite  sample  bound  on  the  estimated  model  generalizes  analogous  results 
from  HKZ  to  the  ease  of  low-rank  T.  Theorem  2  bounds  the  Li  error  in  joint  probabil¬ 
ity  estimates  from  the  learned  model.  This  bound  shows  the  eonsistency  of  the  algorithm 
in  learning  a  eorreet  observable  representation  of  the  underlying  RR-HMM,  without  ever 
needing  to  reeover  the  high-dimensional  parameters  i?,  S',  O  of  the  latent  representation. 
Note  that  our  error  bounds  are  independent  of  m,  the  number  of  hidden  states;  instead, 
they  depend  on  k,  the  rank  of  the  transition  matrix,  whieh  ean  be  mueh  smaller  than  m. 
Since  HKZ  explicitly  assumes  a  full-rank  HMM  transition  matrix,  and  their  bounds  be¬ 
come  vacuous  otherwise,  generalizing  their  framework  involves  relaxing  this  condition, 
generalizing  the  theoretical  guarantees  of  HKZ  and  deriving  proofs  for  these  guarantees. 

Define  <Jk{M)  to  denote  the  largest  singular  value  of  matrix  M.  The  sample  com¬ 
plexity  bounds  depend  polynomially  on  l/cTfc(P2,i)  and  l/ak{OR).  The  larger  (Jk{P2,i) 
is,  the  more  well-separated  are  the  dynamics  from  noise.  The  larger  ak{OR)  is,  the  more 
informative  the  observation  is  regarding  state.  For  both  these  quantities,  the  larger  the 
magnitude,  the  fewer  samples  we  need  to  learn  a  good  model.  The  bounds  also  depend 
on  a  term  no(e),  which  is  the  minimum  number  of  observations  that  account  for  (1  —  e) 
of  the  total  probability  mass,  i.e.  the  number  of  “important”  observations.  Recall  that  N 
is  the  number  of  independently  sampled  observation  triples  which  comprise  the  training 
data,  though  as  mentioned  earlier  we  can  also  learn  from  a  single  long  training  sequence. 

The  theorem  holds  under  mild  conditions.  Some  of  these  are  the  same  as  (or  relaxations 
of)  conditions  in  HKZ,  namely  that  the  prior  if  is  nonzero  everywhere,  and  a  number  of 
matrices  of  interest  (i?,  S',  O,  (U'^OR))  are  of  rank  at  least  k  for  invertibility  reasons.  The 
other  conditions  are  unique  to  the  low-rank  setting,  namely  that  S  diag(7f)0^  has  rank  at 
least  k,  R  has  at  least  one  column  whose  L2  norm  is  at  most  ^^k/m,  and  the  Li  norm  of  R 
is  at  most  1.  The  first  of  these  conditions  implies  that  the  column  space  of  S  and  the  row 
space  of  O  have  some  degree  of  overlap.  The  other  two  are  satisfied,  in  the  case  of  HMMs, 
by  thinking  of  R  as  containing  k  linearly  independent  probability  distributions  along  its 
columns  (including  a  near-uniform  column)  and  of  S  as  containing  the  coefficients  needed 
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to  obtain  T  from  those  columns.  Alternatively,  the  conditions  can  be  satisfied  for  an  arbi¬ 
trary  R  by  scaling  down  entries  of  R  and  scaling  up  entries  of  S  accordingly.  However, 
this  increases  l/ak{OR),  and  hence  we  pay  a  price  by  increasing  the  number  of  samples 
needed  to  attain  a  particular  error  bound.  See  the  Appendix  (Section  A.  1.1)  for  formal 
statements  of  these  conditions. 

Theorem  2  [Generalization  ofHKZ  Theorem  6]  There  exists  a  constant  C*  >  0  such  that 
the  following  holds.  Pick  any  0  <  e,  r/  <  1  and  t  >  1.  Assume  the  HMM  obeys  Conditions 
3, 4, 5, 6  and  7.  Let  e  =  ak{OR)ak{P2,i)^/ {^Is/k).  Assume 

With  probability  >1  —  7],  the  model  returned  by  LearnRR-HMM(/c,  N)  satisfies 

yk  I  Pr[a;i,  Pr[a;i, . . .  ,Xt]|  <  e 

where  the  summation  is  over  all  possible  hidden  state  sequences  of  length  t. 

For  the  proof,  see  the  Appendix  (Section  A.  1.4). 

6.2.4  Learning  with  Observation  Sequences  as  Features 

The  probability  matrix  P2,i  relates  one  past  timestep  to  one  future  timestep,  under  the  as¬ 
sumption  that  the  vector  of  observation  probabilities  at  a  single  step  is  sufficient  to  disam¬ 
biguate  state  (n  >  m  and  rank(O)  =  m).  In  system  identification  theory,  this  corresponds 
to  assuming  1-step  observability  [27].  This  assumption  is  unduly  restrictive  for  many 
real-world  dynamical  systems  of  interest.  More  complex  sufficient  statistics  of  past  and 
future  may  need  to  be  modeled,  such  as  the  block  Hankel  matrix  formulations  for  subspace 
methods  [27,  29]  to  identify  linear  systems  that  are  not  1-step  observable. 

For  RR-HMMs,  this  corresponds  to  the  case  where  n  <  m  and/or  rank(O)  <  m.  Sim¬ 
ilar  to  the  Hankel  matrix  formulation,  we  can  stack  multiple  observation  vectors  such  that 
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each  augmented  observation  comprises  data  from  several,  possibly  consecutive,  timesteps. 
The  observations  in  the  augmented  observation  vectors  are  assumed  to  be  non-overlapping, 
i.e.  all  observations  in  the  new  observation  vector  at  time  t  +  1  have  larger  time  indices 
than  observations  in  the  new  observation  vector  at  time  t.  This  corresponds  to  assuming 
past  sequences  and  future  sequences  spanning  multiple  timesteps  as  events  that  character¬ 
ize  the  dynamical  system,  causing  Pi,P2,i  and  to  be  larger.  Note  that  the  x  in  Ps^x,! 
still  denotes  a  single  observation,  whereas  the  other  indices  in  Pi,  P24  and  Ps^^.i  are  now 
associated  with  events.  For  example,  if  we  stack  n  consecutive  observations,  P3,a;,i[^,  j] 
equals  the  probability  of  seeing  the  n-length  sequence,  followed  by  the  single  observa¬ 
tion  X,  followed  by  the  n-length  sequence.  Empirically  estimating  this  matrix  consists 
of  scanning  for  the  appropriate  subsequences  i  and  j  separated  by  observation  symbol  x, 
and  normalizing  to  obtain  the  occurrence  probability. 

P2,i  and  Ps,x,i  become  larger  matrices  if  we  use  a  larger  set  of  events  in  the  past  and 
future.  However,  stacking  observations  does  not  complicate  the  dynamics:  it  can  be  shown 
that  the  rank  of  P2,i  and  Ps,^,!  cannot  exceed  k  (see  Section  A. 2  in  the  Appendix  for  a  proof 
sketch).  Since  our  learning  algorithm  relies  on  an  SVD  of  P2,i,  this  means  that  augmenting 
the  observations  does  not  increase  the  rank  of  the  HMM  we  are  trying  to  recover.  Also, 
since  P^^x,!  is  still  an  observation  probability  matrix  with  respect  to  a  single  unstacked 
observation  x  in  the  middle,  the  number  of  observable  operators  we  need  remains  constant. 
Our  complexity  bounds  successfully  generalize  to  this  case,  since  they  only  rely  on  Pi, 
P2,i  and  P3,a;,i  being  matrices  of  probabilities  summing  to  1  (for  the  former  two)  or  to 
Pr[a;2  =  x]  (for  the  latter),  as  they  are  here. 

The  extension  given  above  for  learning  HMMs  with  ambiguous  observations  differs 
from  the  approach  suggested  by  HKZ,  which  simply  substitutes  observations  with  over¬ 
lapping  tuples  of  observations  (e.g.  P2,i(j,  i)  =  Pr[x3  =  j2,X2  =  ji,X2  =  i2,xi  =  p]). 
There  are  two  potential  problems  with  the  HKZ  approach.  First,  the  number  of  observable 
operators  increases  exponentially  with  the  length  of  each  tuple:  there  is  one  observable 
operator  per  tuple,  instead  of  one  per  observation.  Second,  P24  cannot  be  decomposed 
into  a  product  of  matrices  that  includes  T,  and  consequently  no  longer  has  rank  equal  to 
the  rank  of  the  HMM  being  modeled.  Thus,  the  learning  algorithm  could  require  much 
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more  data  to  recover  a  correct  model  if  we  use  the  HKZ  approach. 


6.2.5  Learning  with  Indicative  and  Characteristic  Features 

To  generalize  even  further  beyond  sequences  of  observations,  we  can  think  about  deriv¬ 
ing  higher-level  features  of  past  and  future  observations,  which  we  call  indicative  fea¬ 
tures  and  characteristic  features  respectively,  and  generalizing  Pi,  P2,iP3,x,i  to  contain 
expected  values  of  these  features  (or  possibly  products  of  features).  These  are  analogous 
to  but  more  general  than  the  indicative  events  and  characteristic  events  of  OOMs  [69]  and 
suffix-histories  and  tests  of  PSRs  [77].  These  more  specific  formulations  can  be  obtained 
by  assuming  our  indicative  and  characteristic  features  to  be  indicator  variables  of  some 
indicative  and  characteristic  events,  causing  the  expected  values  of  singleton,  pairs  and 
triples  of  these  features  (or  their  products)  to  equal  probabilities  of  occurrence  of  these 
events.  This  leads  to  the  current  scenario  where  the  matrices  Pi,  P2,i  and  P^^x,!  contain 
probabilities,  and  in  fact  our  theoretical  guarantees  hold  in  this  case.  In  the  more  general 
case  where  we  choose  arbitrary  indicative  and  characteristic  features  of  past  and  future 
observations  which  do  not  lead  to  a  probabilistic  interpretation  of  Pi,  P24  and  P^^x,!,  our 
error  bounds  do  not  hold  as  written,  although  there  is  good  reason  to  believe  that  they 
can  be  generalized.  However  the  algorithm  is  still  valid  in  the  sense  that  we  can  prove 
consistency  of  the  resulting  estimates.  See  Section  A.4  of  Appendix  A. 

6.2.6  Kernel  Density  Estimation  for  Continuous  Observations 

The  default  RR-HMM  formulation  assumes  discrete  observations.  However,  since  the 
model  formulation  converts  the  discrete  observations  into  m-dimensional  probability  vec¬ 
tors,  and  the  filtering,  smoothing  and  learning  algorithms  we  discuss  all  do  the  same,  it  is 
straightforward  to  model  multivariate  continuous  data.  Counting  on  our  ability  to  perform 
efficient  inference  and  learning  even  in  very  large  state  spaces,  we  pick  n  points  in  ob¬ 
servation  space  as  kernels,  and  perform  Kernel  Density  Estimation  (KDE)  to  convert  each 
continuous  datapoint  into  a  probability  vector.  In  our  experiments  we  use  Gaussian  kernels 
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with  uniform  bandwidths,  however  any  smooth  kernel  will  suffice.  The  kernel  centers  can 
be  chosen  by  sampling,  clustering  or  other  standard  methods.  In  our  experiments  below, 
we  use  an  initial  subsequence  of  training  data  points  as  kernel  centers  directly.  The  kernel 
bandwidths  can  be  estimated  from  data  using  maximum  likelihood  estimation,  or  can  be 
fixed  beforehand.  In  the  limit  of  infinite  training  data,  the  bandwidth  can  shrink  to  zero 
and  the  KDE  gives  an  accurate  estimate  of  the  observation  density.  Our  theoretical  results 
carry  over  to  the  KDE  case  with  modifications  described  in  Section  A.  1.5.  Essentially, 
the  bounds  still  hold  for  the  case  where  we  are  observing  stochastic  vectors,  though  we 
do  not  yet  have  additional  bounds  connecting  this  existing  bound  to  the  error  in  estimat¬ 
ing  probabilities  of  raw  continuous  observations.  When  filtering  in  this  formulation,  we 
compute  a  vector  of  kernel  center  probabilities  for  each  observation.  We  normalize  this 
vector  and  use  it  to  generate  a  convex  combination  of  observable  operators,  which  is  used 
for  incorporating  the  observation  rather  than  any  single  observable  operator. 

This  affects  the  learning  algorithm  and  inference  procedure  as  follows.  Assume  for 
ease  of  notation  that  the  training  data  consists  of  N  sets  of  three  consecutive  continuous 
observation  vectors  each,  i.e.  fi,2,  ^1,3),  (^2,1,  ^2,2,  ^2,3),  •  •  • ,  {xn,i,xn,2,xn,3)}, 

though  in  practice  we  could  be  learning  from  a  single  long  training  sequence  (or  sev¬ 
eral).  Also  assume  for  now  that  each  observation  vector  contains  a  single  raw  observation, 
though  this  technique  can  easily  be  combined  with  the  more  sophisticated  sequence-based 
learning  and  feature-based  learning  methods  described  above.  We  will  assume  n  kernel 
centers  chosen  from  the  training  data,  where  the  kernel  is  centered  at  q,  and  each  kernel 
has  a  covariance  matrix  of  S.  A  is  a  bandwidth  parameter  that  goes  to  zero  over  time.  Eet 
A/'(/i,  C)  denote  a  multivariate  Gaussian  distribution  with  mean  jl  and  covariance  matrix 
C,  and  Pr[r/  |  C)]  be  the  probability  of  y  under  this  Gaussian. 

The  learning  algorithm  begins  by  computing  n-dimensional  feature  vectors  (0^)^^, 
and  where 


[$j]i  =  Pr[x)-i  I  A/'(ci,S)] 
=  Pr[Tj-2  I  A/'(ci,S)] 
£■]*  =  I  A/'(c;,S)] 
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In  addition,  the  n-dimensional  vectors  (Ci)^i  represent  the  kernel  probabilities  of  the 
second  observation  which  shrink  to  zero  in  the  limit  of  infinite  data  at  an  appropriate  rate 
via  the  bandwidth  parameter  A: 

=  Pr[:?j,2  I  A/'(ci,  AS)] 


These  vectors  are  then  normalized  to  sum  to  1  individually.  Then,  the  vector  Pi  and 
matrices  P2,i  and  (for  a  given  x),  can  be  estimated  as  follows: 
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^'  =  ivSA 

i=i 
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N 


^  1  .y  ^  ^  ^ 

For  X  =  Cl, . . . ,  c„.  Ps^x,!  —  ^  ^  ^j[Cj]x^j4^j 

i=i 

Once  these  matrices  have  been  estimated,  the  rest  of  the  algorithm  is  unchanged.  We 
compute  n  ‘base’  observable  operators  , . . . ,  Pc„  from  the  Ps^x,!  matrices  estimated 
above,  and  vectors  bi  and  b^o,  as  before.  Given  these  parameter  estimates,  filtering  for  a 
r-length  observation  sequence  {xi, . . . ,  Xr)  now  proceeds  in  the  following  way: 


For  t  =  1, . . . ,  r: 

Compute  at  such  that  [at\i 

n 

BaJjt 


bt+i  — 


bf 


PT[xt  I  A/’(ci,  AS)],  and  normalize. 


6.3  Experimental  Results 

We  designed  several  experiments  to  evaluate  the  properties  of  RR-HMMs  and  the  learning 
algorithm  both  on  synthetic  and  on  real-world  data.  The  first  set  of  experiments  (Section 
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Figure  6.3:  Learning  diserete  RR-HMMs.  The  three  figures  depiet  the  aetual  eigenvalues 
of  three  different  RR-HMM  transition  matriees,  and  the  eigenvalues  (95%  error  bars)  of 
the  sum  of  RR-HMM  observable  operators  estimated  with  10, 000  and  100,  000  training 
observations.  (A)  A  3-state,  3-observation,  rank  2  RR-HMM.  (B)  A  full-rank,  3-state, 
2-observation  HMM.  (C)  A  4-state,  2-observation,  rank  3  RR-HMM. 

6.3.1)  tests  the  ability  of  the  speetral  learning  algorithm  to  reeover  the  eorreet  RR-HMM. 
The  seeond  experiment  (Seetion  6.3.2)  evaluates  the  representational  eapaeity  of  the  RR- 
HMM  by  learning  a  model  of  a  video  that  requires  both  eompetitive  inhibition  and  smooth 
state  evolution.  The  third  set  of  experiments  (Seetion  6.3.3)  tests  the  model’s  ability  to 
learn,  filter,  prediet,  and  simulate  video  eaptured  from  a  robot  moving  in  an  indoor  offiee 
environment. 

6.3.1  Learning  Synthetic  RR-HMMs 

First  we  evaluate  the  unbiasedness  of  the  speetral  learning  algorithm  for  RR-HMMs  on 
3  synthetie  examples.  In  eaeh  ease,  we  build  an  RR-HMM,  sample  observations  from 
the  model,  and  estimate  the  model  with  the  speetral  learning  algorithm  deseribed  in  See¬ 
tion  6.2.  We  eompare  the  eigenvalues  of  B  =  in  the  learned  model  to  the  eigenval¬ 

ues  of  the  transition  matrix  T  of  the  true  model,  i?  is  a  similarity  transform  of  S  ■  R  whieh 
therefore  has  the  same  non-zero  eigenvalues  as  T  =  RS,  so  we  expeet  the  estimated  eigen¬ 
values  to  eonverge  to  the  true  eigenvalues  with  enough  data.  This  is  a  neeessary  eondition 
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for  unbiasedness  but  not  a  suffieient  one.  See  Seetion  A. 3  in  Appendix  for  parameters  of 
HMMs  used  in  the  examples  below. 

Example  1:  An  RR-HMM  We  examine  an  HMM  with  m  =  3  hidden  states,  n  =  3 
observations,  a  full-rank  observation  matrix  and  a  k  =  2  rank  transition  matrix.  Fig¬ 
ure  6.3(A)  plots  the  true  and  estimated  eigenvalues  for  inereasing  size  of  dataset,  along 
with  error  bars,  suggesting  that  we  reeover  the  true  dynamie  model. 

Example  2:  A  2-step-Observable  HMM  We  examine  an  HMM  with  m  =  3  hidden 
states,  n  =  2  observations,  and  a  full-rank  transition  matrix  (see  Appendix  for  parame¬ 
ters).  This  HMM  violates  the  m  <  n  eondition.  The  parameters  of  this  HMM  eannot  be 
estimated  with  the  original  learning  algorithm,  sinee  a  single  observation  does  not  pro¬ 
vide  enough  information  to  disambiguate  state.  By  staeking  2  eonseeutive  observations 
(see  Seetion  6.2.4),  however,  the  speetral  learning  algorithm  ean  be  applied  sueeessfully 
(Figure  6.3(B)). 

Example  3:  A  2-step-Observable  RR-HMM  We  examine  an  HMM  with  m  =  4  hid¬ 
den  states,  n  =  2  observations,  and  a  k  =  3  rank  transition  matrix  (see  Appendix  for  pa¬ 
rameters).  In  this  example,  the  HMM  is  low  rank  and  multiple  observations  are  required 
to  disambiguate  state.  Again,  staeking  two  eonseeutive  observations  in  eonjunetion  with 
the  speetral  learning  algorithm  is  enough  to  reeover  good  RR-HMM  parameter  estimates 
(Figure  6.3(C)). 

6.3.2  Competitive  Inhibition  and  Smooth  State  Evolution  in  Video 

We  model  a  eloek  pendulum  video  eonsisting  of  55  frames  (with  a  period  of  ~  22  frames) 
as  a  10-state  HMM,  a  10-dimensional  LDS,  and  a  rank  10  RR-HMM  with  4  staeked  obser¬ 
vations.  Note  that  we  eould  easily  learn  models  with  more  than  10  latent  states/dimensions; 
we  limited  the  dimensionality  in  order  to  demonstrate  the  relative  expressive  power  of  the 
different  models.  For  the  HMM,  we  eonvert  the  eontinuous  data  to  diserete  observations 
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Figure  6.4:  The  eloek  video  texture  simulated  by  a  HMM,  a  stable  LDS,  and  a  RR-HMM. 
(A)  The  eloek  modeled  by  a  10-state  HMM.  The  manifold  eonsists  of  the  top  3  prineipal 
eomponents  of  predieted  observations  during  simulation.  The  generated  frames  are  eoher- 
ent  but  motion  in  the  video  is  jerky.  (B)  The  eloek  modeled  by  a  10-dimensional  LDS.  The 
manifold  indieates  the  trajeetory  of  the  model  in  state  spaee  during  simulation.  Motion  in 
the  video  is  smooth  but  frames  degenerate  to  superpositions.  (C)  The  eloek  modeled  by  a 
rank  10  RR-HMM.  The  manifold  eonsists  of  the  trajeetory  of  the  model  in  the  low  dimen¬ 
sional  subspaee  of  the  state  spaee  during  simulation.  Both  the  motion  and  the  frames  are 
eorreet. 


by  1-NN  on  25  kernel  eenters  sampled  sequentially  from  the  training  data.  We  trained  the 
resulting  diserete  HMM  using  EM.  We  learned  the  LDS  direetly  from  the  video  using  sub¬ 
spaee  ID  with  stability  eonstraints  [78]  using  a  Hankel  matrix  of  10  staeked  observations. 
We  trained  the  RR-HMM  by  staeking  4  observations,  ehoosing  an  approximate  rank  of 
10  dimensions,  and  learning  25  observable  operators  eorresponding  to  25  Gaussian  kernel 
eenters.  We  simulate  a  series  of  500  observations  from  the  model  and  eompare  the  man¬ 
ifolds  underlying  the  simulated  observations  and  frames  from  the  simulated  videos  (Fig¬ 
ure  6.4).  The  small  number  of  states  in  the  HMM  is  not  suffieient  to  eapture  the  smooth 
evolution  of  the  eloek:  the  simulated  video  is  eharaeterized  by  realistie  looking  frames, 
but  exhibits  jerky  irregular  motion.  For  the  LDS,  although  the  10 -dimensional  subspaee 
eaptures  smooth  evolution  of  the  simulated  video,  the  system  quiekly  degenerates  and  in¬ 
dividual  frames  of  video  are  modeled  poorly  (resulting  in  superpositions  of  pendulums 
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A.  The  Robot 


B.  Example  Images 


Figure  6.5:  (A)  The  mobile  robotic  platform  used  in  experiments.  (B)  Sample  images 
from  the  robot’s  camera.  The  lower  figure  depicts  the  hallway  environment  with  a  central 
obstacle  (black)  and  the  path  that  the  robot  took  through  the  environment  while  collecting 
data  (the  red  counter-clockwise  ellipse)  (C)  Squared  loss  prediction  error  for  different 
models  after  filtering  over  initial  part  of  data.  The  RR-HMM  performs  more  accurate 
predictions  consistently  for  30  timesteps. 

in  generated  frames).  For  the  RR-HMM,  the  simulated  video  benefits  from  both  smooth 
state  evolution  and  competitive  inhibition.  The  manifold  in  the  10 -dimensional  subspace 
is  smooth  and  structured  and  the  video  is  realistic.  The  results  demonstrate  that  the  RR- 
HMM  has  the  benefits  of  smooth  state  evolution  and  compact  state  space  of  a  LDS  and  the 
benefit  of  competitive  inhibition  of  a  HMM. 


6.3.3  Filtering,  Prediction,  and  Simulation  with  Robot  Vision  Data 

We  compare  HMMs,  LDSs,  and  RR-HMMs  on  the  problem  of  modeling  video  data  from  a 
mobile  robot  in  an  indoor  environment.  A  video  of  2000  frames  was  collected  at  6  Hz  from 
a  Point  Grey  Bumblebee2  stereo  camera  mounted  on  a  Botrics  Obot  dlOO  mobile  robot 
platform  (Figure  6.5(A))  circling  a  stationary  obstacle  (Figure  6.5(B))  and  1500  frames 
were  used  as  training  data  for  each  model.  Each  frame  from  the  training  data  was  reduced 
to  100  dimensions  via  SVD  on  single  observations.  Using  this  training  data,  we  trained 
an  RR-HMM  {k  =  50,  n  =  1500)  using  spectral  learning  with  20  stacked  continuous 
observations  and  KDE  (Section  6.2.6)  with  1500  centers,  a  50-dimensional  LDS  using 
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Subspace  ID  with  Hankel  matrices  of  20  timesteps,  and  a  50-state  HMM  with  1500  discrete 
observations  using  EM  run  until  convergence.  For  each  model,  we  performed  filtering  for 
different  extents  ti  =  100, 101, . . . ,  250,  then  predicted  an  image  which  was  a  further  t2 
points  in  the  future,  for  ^2  =  1, 2  . . . ,  100.  The  squared  error  of  this  prediction  in  pixel 
space  was  recorded,  and  averaged  over  all  the  different  filtering  extents  ti  to  obtain  means 
which  are  plotted  in  Figure  6.5(C).  As  baselines,  we  plot  the  error  obtained  by  using  the 
mean  of  filtered  data  as  a  predictor  (titled  ‘Mean’),  and  the  error  obtained  by  using  the  last 
filtered  observation  as  a  predictor  (titled  ‘Fast’). 

Both  baselines  perform  worse  than  any  of  the  more  complex  algorithms  (though  as 
expected,  the  ‘Fast’  predictor  is  a  good  one-step  predictor),  indicating  that  this  is  a  non¬ 
trivial  prediction  problem.  The  FDS  does  well  initially  (due  to  smoothness),  and  the  HMM 
does  well  in  the  longer  run  (due  to  competitive  inhibition),  while  the  RR-HMM  performs 
as  well  or  better  at  both  time  scales  since  it  models  both  the  smooth  state  evolution  and 
competitive  inhibition  in  its  predictive  distribution.  In  particular,  the  RR-HMM  yields  sig¬ 
nificantly  lower  prediction  error  consistently  for  the  first  30  timesteps  (i.e.  five  seconds)  of 
the  prediction  horizon. 


6.4  Related  Work 

6.4.1  Predictive  State  Representations 

Predictive  State  Representations  (PSRs)  [68, 77]  and  Observable  Operator  Models  (OOMs)  [69] 
model  sequence  probabilities  as  a  product  of  observable  operator  matrices.  This  idea,  as 
well  as  the  idea  of  learning  such  models  using  linear  algebra  techniques,  originates  in  the 
literature  on  multiplicity  automata  and  weighted  automata  [71,  72,  70].  Despite  recent 
improvements  [79,  80],  practical  learning  algorithms  for  PSRs  and  OOMs  have  been  lack¬ 
ing.  RR-HMMs  and  its  spectral  learning  algorithm  are  also  closely  related  to  methods  in 
subspace  identification  [27,  29]  in  control  systems  for  learning  FDS  parameters,  which 
use  SVD  to  determine  the  relationship  between  hidden  states  and  observations. 
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As  pointed  out  earlier,  the  spectral  learning  algorithm  presented  here  learns  PSRs.  We 
briefly  discuss  other  algorithms  for  learning  PSRs  from  data.  Several  learning  algorithms 
for  PSRs  have  been  proposed  [8 1 , 75,  82] .  It  is  easier  for  PSR  learning  algorithms  to  return 
consistent  parameter  estimates  because  the  parameters  are  based  on  observable  quantities. 
[74]  develops  an  SVD-based  method  for  finding  a  low-dimensional  variant  of  PSRs,  called 
Transformed  PSRs  (TPSRs).  Instead  of  tracking  the  probabilities  of  a  small  number  of 
tests,  TPSRs  track  a  small  number  of  linear  combinations  of  a  larger  number  of  tests.  This 
allows  more  compact  representations,  as  well  as  dimensionality  selection  based  on  exam¬ 
ining  the  singular  values  of  the  decomposed  matrix,  as  in  subspace  identification  methods. 
Note  that  nonlinearity  can  be  encoded  into  the  design  of  core  tests.  [83]  introduced  the 
concept  of  e-tests  in  PSRs  that  are  indicator  functions  of  aggregate  sets  of  future  outcomes, 
e.g.  all  sequence  of  observations  in  the  immediate  future  that  end  with  a  particular  obser¬ 
vation  after  k  timesteps.  In  general,  tests  in  discrete  PSRs  can  be  indicator  functions  of 
arbitrary  statistics  of  future  events,  thus  encoding  nonlinearities  that  might  be  essential  for 
modeling  some  dynamical  systems.  Recently,  Exponential  Family  PSRs  (EFPSRs)  [79] 
were  introduced  as  an  attempt  to  generalize  the  PEG  model  to  allow  general  exponential 
family  distributions  over  the  next  N  observations.  In  the  EFPSR,  state  is  represented  by 
modeling  the  parameters  of  a  time-varying  exponential  family  distribution  over  the  next  N 
timesteps.  This  allows  graphical  structure  to  be  encoded  in  the  distribution,  by  choosing 
the  parameters  accordingly.  The  justification  for  choosing  an  exponential  family  comes 
from  maximum  entropy  modeling.  Though  inference  and  parameter  learning  are  difficult 
in  graphical  models  of  non-trivial  structure,  approximate  inference  methods  can  be  uti¬ 
lized  to  make  these  problems  tractable.  Eike  PEGs,  the  dynamical  component  of  EFPSRs 
is  modeled  by  extending  and  conditioning  the  distribution  over  time.  However,  the  method 
presented  [79]  has  some  drawbacks,  e.g.  the  extend-and-condition  method  is  inconsistent 
with  respect  to  marginals  over  individual  timesteps  between  the  extended  and  un-extended 
distributions. 
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6.4.2  Hybrid  Models,  Mixture  Models  and  other  recent  approaches 


RR-HMMs  and  their  algorithms  are  also  related  to  other  hybrid  models.  Note  that  previous 
models  of  the  same  name  (e.g.  [84])  address  a  eompletely  different  problem,  i.e.  redueing 
the  rank  of  the  Gaussian  observation  parameters.  Sinee  shortly  after  the  advent  of  LDSs, 
there  have  been  attempts  to  eombine  the  diserete  states  of  HMMs  with  the  smooth  dynam- 
ies  of  LDSs.  We  perform  a  brief  review  of  the  literature  on  hybrid  models;  see  [85]  for  a 
more  thorough  review.  [86]  formulates  a  switehing  LDS  variant  where  both  the  state  and 
observation  variable  noise  models  are  mixture  of  Gaussians  with  the  mixture  switehing 
variable  evolving  aeeording  to  Markovian  dynamies,  and  derives  the  (intraetable)  optimal 
filtering  equations  where  the  number  of  Gaussians  needed  to  represent  the  belief  inereases 
exponentially  over  time.  They  also  propose  an  approximate  filtering  algorithm  for  this 
model  based  on  a  single  Gaussian.  [87]  proposes  learning  algorithms  for  an  LDS  with 
switehing  observation  matriees.  [88]  reviews  models  where  both  the  observations  and 
state  variable  switeh  aeeording  to  a  diserete  variable  with  Markov  transitions.  Hidden  Fil¬ 
ter  HMMs  (HFHMMs)  [89]  eombine  diserete  and  real-valued  state  variables  and  outputs 
that  depend  on  both.  The  real-valued  state  is  deterministieally  dependent  on  previous  ob¬ 
servations  in  a  known  manner,  and  only  the  diserete  variable  is  hidden.  This  allows  exaet 
inferenee  in  this  model  to  be  traetable.  [90]  formulates  the  Mixture  Kalman  Filter  (MKF) 
model  along  with  a  filtering  algorithm,  similar  to  [86]  exeept  that  the  filtering  algorithm  is 
based  on  sequential  Monte-Carlo  sampling. 

The  eommonly  used  HMMs  with  mixture-model  observations  (e.g.,  Gaussian  mixture) 
are  a  speeial  ease  of  RR-HMMs.  A  fc-state  HMM  where  eaeh  state  eorresponds  to  a 
Gaussian  mixture  of  m  observation  models  of  n  dimensions  eaeh  is  subsumed  by  a  k- 
rank  RR-HMM  with  m  distinet  eontinuous  observations  of  n  dimensions  eaeh,  sinee  the 
former  is  eonstrained  to  be  non-negative  and  <  1  in  various  plaees  (the  /c-dimensional 
transition  matrix,  the  fc-dimensional  belief  veetor,  the  matrix  whieh  transforms  this  belief 
to  observation  probabilities)  where  the  latter  is  not. 

Switching  State-Space  Models  (SSSMs)  [85]  posit  the  existenee  of  several  real-valued 
hidden  state  variables  that  evolve  linearly,  with  a  single  Markovian  diserete-valued  switeh- 


101 


ing  variable  selecting  the  state  which  explains  the  real-valued  observation  at  every  timestep. 
Since  exact  inference  and  learning  are  intractable  in  this  model,  the  authors  derive  a 
structured  variational  approximation  that  decouples  the  state  space  and  switching  vari¬ 
able  chains,  effectively  resulting  in  Kalman  smoothing  on  the  state  space  variables  and 
HMM  forward-backward  on  the  switching  variable.  In  their  experiments,  the  authors  find 
SSSMs  to  perform  better  than  regular  LDSs  on  a  physiological  data  modeling  task  with 
multiple  distinct  underlying  dynamical  models.  HMMs  performed  comparably  well  in 
terms  of  log-likelihood,  indicating  their  ability  to  model  nonlinear  dynamics  though  the 
resulting  model  was  less  interpretable  than  the  best  SSSM.  More  recently,  models  for 
nonlinear  time  series  modeling  such  as  Gaussian  Process  Dynamical  Models  have  been 
proposed  [91].  However,  the  parameter  learning  algorithm  is  only  locally  optimal,  and  ex¬ 
act  inference  and  simulation  are  very  expensive,  requiring  MCMC  over  a  long  sequence  of 
frames  all  at  once.  This  necessitates  the  use  of  heuristics  for  both  inference  and  learning. 
Another  recent  nonlinear  dynamic  model  is  [92],  which  differs  greatly  from  other  methods 
in  that  it  treats  each  component  of  the  dynamic  model  learning  problem  separately  using 
supervised  learning  algorithms,  and  proves  consistency  on  the  aggregate  result  under  cer¬ 
tain  strong  assumptions. 


6.5  Discussion 

Much  as  the  Subspace  ID  algorithm  (Chapter  3)  provides  a  predictive,  local  optima-free 
method  for  learning  continuous  latent-variable  models,  the  spectral  learning  algorithm 
here  blurs  the  line  between  latent  variable  models  and  PSRs.  PSRs  were  developed  with 
a  focus  on  the  problem  of  an  agent  planning  actions  in  a  partially  observable  environ¬ 
ment.  More  generally,  there  are  many  scenarios  in  sequential  data  modeling  where  the 
underlying  dynamical  system  has  inputs.  The  inference  task  for  a  learned  model  is  then  to 
track  the  belief  state  while  conditioning  on  observations  and  incorporating  the  inputs.  The 
input-output  HMM  (lO-HMM)  [17]  is  a  conditional  probabilistic  model  which  has  these 
properties.  A  natural  generalization  of  this  work  is  to  the  task  of  learning  RR-HMMs  with 
inputs,  or  controlled  PSRs.  We  discuss  this  further  in  Chapter  7.  We  recently  carried  out 
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this  generalization  to  controlled  PSRs;  details  can  be  found  in  [73]. 

The  question  of  proving  containment  or  equivalence  of  RR-HMMs  with  respect  to 
PSRs  is  of  theoretical  interest.  The  observable  representation  of  an  RR-HMM  is  a  Trans¬ 
formed  PSR  (TPSR)  [74],  so  every  RR-HMM  is  a  PSR;  it  remains  to  be  seen  whether  every 
PSR  corresponds  to  some  RR-HMM  (possibly  with  an  infinite  number  of  discrete  hidden 
states)  as  well.  The  idea  that  “difficult”  PSRs  should  somehow  correspond  to  RR-HMMs 
with  very  large  or  infinite  state  space  is  intuitively  appealing  but  not  straightforward  to 
prove.  Another  interesting  direction  would  be  to  bound  the  performance  of  the  learning 
algorithm  when  the  underlying  model  is  only  approximately  a  reduced-rank  HMM,  much 
as  the  HKZ  algorithm  includes  bounds  when  the  underlying  model  is  approximately  an 
HMM  [12].  This  would  be  useful  since  in  practice  it  is  more  realistic  to  expect  any  under¬ 
lying  system  to  not  comply  with  the  exact  model  assumptions. 

The  positive  realization  problem,  i.e.  obtaining  stochastic  transition  and  observation 
matrices  from  the  RR-HMM  observable  representation,  is  also  significant,  though  the  ob¬ 
servable  representation  allows  us  to  carry  out  all  possible  HMM  operations.  HKZ  de¬ 
scribes  a  method  based  on  [93]  which,  however,  is  highly  erratic  in  practice.  In  the  RR- 
HMM  case,  we  have  the  additional  challenge  of  firstly  computing  the  minimal  m  for  which 
a  positive  realization  exists,  and  since  the  algorithm  learns  PSRs  there  is  no  guarantee  that 
a  particular  set  of  learned  parameters  conforms  exactly  to  any  RR-HMM.  On  the  applica¬ 
tions  side,  it  would  be  interesting  to  compare  RR-HMMs  with  other  dynamical  models  on 
classification  tasks,  as  well  as  on  learning  models  of  difficult  video  modeling  and  graph¬ 
ics  problems  for  simulation  purposes.  More  elaborate  choices  of  features  may  be  useful 
in  such  applications,  as  would  be  the  usage  of  high-dimensional  or  infinite-dimensional 
features  via  Reducing  Kernel  Hilbert  Spaces  (RKHS). 
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Chapter  7 


Future  Work  and  Discussion 


The  area  of  dynamieal  system  modeling  is  of  great  importance  to  machine  learning  and 
robotics,  and  the  ideas  in  this  thesis  lead  to  many  promising  avenues  of  research  in  these 
fields.  In  this  chapter  we  first  describe  several  possible  extensions  of  models  and  algo¬ 
rithms  we  have  presented.  We  next  summarize  the  main  contributions  of  this  thesis  and 
discuss  its  significance  in  the  context  of  the  larger  body  of  machine  learning  research. 


7.1  Future  Work 

7.1.1  Scaling  STAGS  for  learning  very  large  state-space  HMMs 

While  efficient  inference  algorithms  have  been  proposed  for  HMMs  with  up  to  a  thousand 
states  based  on  specific  properties  of  the  underlying  parameter  space  [48,  94],  it  is  more 
difficult  to  learn  exact  HMMs  of  such  large  finite  state  space  size.  Research  on  the  infinite 
HMM  and  its  variants  [22,  21]  provides  sampling-based  methods  for  learning  nonpara- 
metric  HMM  models  with  an  effectively  infinite  number  of  states,  but  these  algorithms  are 
inefficient  and  don’t  scale  well  to  high-dimensional  sequences.  The  STAGS  and  V-STACS 
algorithms  presented  in  Chapter  4  efficiently  learn  accurate  models  of  HMMs  with  up  to  a 
hundred  states  or  so.  Beyond  that,  several  factors  hinder  efficient  and  accurate  learning: 
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1 .  The  computational  cost  of  performing  split-tests  on  each  HMM  state  at  every  model 
expansion  step. 

2.  The  cost  of  performing  EM  or  Viterbi  Training  on  the  entire  HMM  in  every  model 
learning  step. 

3.  The  increased  chance  of  false  positives  in  the  candidate  testing  step. 

The  upshot  is  that,  if  we  do  manage  to  learn  accurate  models  of  this  size,  the  aforemen¬ 
tioned  inference  algorithms  for  large-state-space  HMMs  would  allow  us  to  carry  out  effi¬ 
cient  inference  in  these  learned  HMMs  under  a  variety  of  possible  additional  assumptions, 
some  of  which  are  relatively  reasonable  for  large  state  spaces  (e.g.  assuming  the  transition 
matrix  is  additively  sparse  -f  low-rank  [48]). 

One  way  to  address  (1)  is  lazy  testing  of  candidates.  If  a  particular  candidate  state 
changes  minimally  between  model  expansion  steps  in  terms  of  its  transition  and  observa¬ 
tion  distributions,  it  is  unlikely  that  its  rank  as  a  split  candidate  will  change  much.  We 
can  develop  this  intuition  further  to  form  a  criterion  for  selectively  testing  only  those  can¬ 
didates  that  either  were  likely  candidates  in  the  previous  model  expansion  step  or  have 
changed  substantially  since  then.  Another,  more  exact  way  of  addressing  (1)  is  to  take 
advantage  of  the  highly  decoupled  nature  of  split-tests  for  each  state  and  run  them  in  par¬ 
allel,  which  is  an  increasingly  feasible  option  in  light  of  the  growing  availability  of  parallel 
architectures  and  software  for  parallelizing  conventional  machine  learning  algorithms. 

The  limitation  of  performing  EM  on  large  state  spaces  (2)  can  similarly  be  addressed 
by  exploiting  recent  developments  in  parallelizing  belief  propagation  (e.g.  [95]).  Another 
possibility  is  to  only  update  regions  of  state  space  that  have  changed  significantly  w.r.t. 
previous  iterations. 

During  split  testing,  the  null  hypothesis  states  that  no  split  justifies  the  added  com¬ 
plexity  (either  through  BIC  score,  test-set  log-likelihood  or  other  metric).  The  increased 
risk  of  false  positives  in  split  testing  (3)  arises  from  the  increased  number  of  alternative 
hypotheses  available,  since  the  likelihood  that  one  of  them  will  randomly  beat  the  null  hy¬ 
pothesis  increases.  We  can  use  techniques  from  multiple  hypothesis  testing  [96]  to  modify 
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the  split  scores  and  compensate  for  this  phenomenon.  Though  this  avoids  continuing  to 
split  needlessly,  it  doesn’t  necessarily  make  sure  we  choose  the  best  candidate  at  each  iter¬ 
ation.  A  more  accurate  scoring  criterion  based  on  variational  Bayes  [58]  or  using  test-set 
loglikelihood  for  scoring  (if  enough  data  and  computational  resources  are  present)  would 
help  address  this  issue. 

7.1.2  Constraint  generation  for  learning  stable  PSRs 

Stability  is  a  desirable  characteristic  of  a  wide  variety  of  sequential  data  models.  For  exam¬ 
ple,  a  necessary  condition  for  a  PSR  to  be  stable  is  that  the  sum  of  the  observable  operators 
in  a  PSR  or  OOM  must  have  at  most  unit  spectral  radius.^  Chapter  6  describes  a  spectral 
learning  algorithm  for  RR-HMMs  that  is  also  shown  to  be  able  to  learn  PSRs/OOMs  (Sec¬ 
tion  A.5  in  the  Appendix).  The  constraint  generation  algorithm  described  in  Chapter  5 
can  be  adapted  to  solve  a  quadratic  programming  problem  that  constrains  the  appropriate 
spectral  radius  in  order  to  achieve  stability.  Since  PSRs  with  actions  (i.e.  controlled  PSRs) 
were  designed  with  the  goal  of  enabling  efficient  and  accurate  planning,  stability  becomes 
especially  desirable  for  PSR  parameters. 

7.1.3  Efficient  planning  in  empirically  estimated  RR-POMDPs 

Because  of  the  difficulties  associated  with  learning  POMDPs  from  data,  POMDP  planning 
is  typically  carried  out  using  hand-coded  POMDPs  whose  belief  evolves  in  a  pre-specified 
state  space.  However,  the  spectral  learning  algorithm  for  RR-HMMs,  together  with  ef¬ 
ficient  point-based  planning  algorithms  such  as  [98,  99,  100],  can  allow  us  to  close  the 
loop  by  learning  models,  planning  in  them  and  using  the  resultant  data  to  update  the  learnt 
model.  This  effectively  leaves  the  task  of  state  space  formulation  up  to  spectral  learning  to 
decide  the  optimal  subspace  for  planning.  While  not  as  easily  interpretable  a  priori,  this 
method  has  the  potential  of  discovering  far  more  compact  state  spaces  for  planning  prob- 

fihe  sufficient  condition,  namely  that  the  model  only  emits  positive  probabilities,  is  undecidable  [97]. 
However,  setting  negatives  to  a  small  positive  number  and  renormalizing  works  in  practice. 
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lems  based  on  the  speeifie  task  and  value  funetion  involved,  espeeially  sinee  RR-POMDPs 
are  more  expressive  than  POMDPs  of  the  same  dimension.  The  eonneetion  of  RR-HMMs 
to  PSRs  means  that  sueh  a  generalization  would  effeetively  learn  models  of  eontrolled 
PSRs  from  data.  We  have  reeently  extended  the  speetral  learning  algorithm  to  this  see- 
nario,  with  positive  results;  in  partieular,  we  are  able  to  elose  the  learning-planning  loop 
by  earrying  out  planning  in  a  model  estimated  from  data  on  a  simulated  high-dimensional 
vision-based  robot  navigation  task  [73].  Note  that  this  is  different  from  Linear-Quadratie- 
Gaussian  (LQG)  eontrol,  whieh  is  the  task  of  optimally  eontrolling  an  LDS  that  has  eontrol 
inputs. 


7.1.4  Learning  RR-HMMs  over  arbitrary  observation  features 

We  diseussed  how  to  learn  RR-HMMs  over  single  observations  as  well  as  staeked  veetors 
of  multiple  observations  for  multiple-step  observable  systems.  The  logieal  generalization 
is  to  enable  RR-HMMs  over  arbitrary  indicative  and  characteristic  features  of  past  and 
future  observations.  We  ean  show  that  the  same  speetral  learning  algorithm  yields  unbi¬ 
ased  estimates  for  RR-HMM  parameters  over  sueh  features.  Generalization  of  the  sample 
eomplexity  bounds  should  in  prineiple  follow  using  the  same  sampling  bounds  and  matrix 
perturbation  bounds  as  used  here,  and  is  an  important  task  for  future  work. 

7.1.5  Hilbert  space  embeddings  of  RR-HMMs 

An  alternative  generalization  of  RR-HMM  expressiveness  is  to  infinite-dimensional  fea¬ 
ture  spaees  via  kernels.  Reeent  work  has  shown  how  to  embed  eonditional  distributions 
in  Hilbert  spaee  [101],  allowing  generalization  of  eonditional  distributions  to  infinite¬ 
dimensional  feature  spaees  for  several  applieations.  Sinee  a  sequential  model  is  basieally 
a  series  of  eonditional  distributions,  [101]  deseribes  an  applieation  of  eonditional  distri¬ 
bution  embedding  to  learning  LDS  models  when  the  latent  state  is  observed.  It  should 
be  possible  to  formulate  a  kemelized  HMM  or  RR-HMM  in  a  similar  fashion.  Further 
extending  the  speetral  learning  method  to  learn  RR-HMMs  in  Hilbert  spaee  would  allow 
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HMMs  over  much  more  complex  observation  spaces. 


7.1.6  Sample  complexity  bounds  for  spectral  learning  of  PSRs 

We  saw  in  Chapter  6  that  the  RR-HMM  learning  algorithm  returns  unbiased  estimates  of 
PSRs  as  well.  In  fact,  even  though  for  finite  data  samples  the  model  returned  may  not 
correspond  to  any  RR-HMM  of  finite  state  space  size,  it  always  corresponds  to  some  PSR 
if  it  is  stable.  What  remains  to  be  done  is  to  generalize  the  sample  complexity  bounds 
to  the  PSR  case  as  well.  Since  there  are  no  distinct  observation  and  transition  matrices 
in  PSRs,  and  the  observable  operators  and  belief  states  are  not  necessarily  stochastic,  the 
bounds  and  proofs  may  differ.  However,  the  same  basic  perturbation  and  sampling  error 
ideas  should,  in  theory,  hold. 

7.1.7  Spectral  learning  of  exponential  family  RR-HMMs 

Belief  Compression  [102]  exploits  the  stochasticity  of  HMM  beliefs  in  order  to  compress 
them  more  effectively  for  POMDP  planning  using  exponential  family  PCA  [103].  In  a 
similar  vein,  it  should  be  possible  to  extend  RR-HMMs  and  their  learning  algorithms  to 
the  exponential  family  case.  Gordon  (2002) [104]  presents  a  more  efficient  and  general 
algorithm  for  carrying  out  exponential  family  PCA  which  could  be  used  for  such  an  ex¬ 
tension. 


7.2  Why  This  Thesis  Matters 

Whether  the  goal  is  to  perform  accurate  prediction  and  sequence  classification  for  statis¬ 
tical  data  mining,  or  to  model  a  partially  observable  dynamic  environment  for  reasoning 
under  uncertainty,  the  modeling  of  dynamical  systems  is  a  central  task.  Due  to  their  the¬ 
oretical  properties  and  their  effectiveness  in  practice,  LVMs  are  the  predominant  class  of 
probabilistic  models  for  representing  dynamical  systems.  This  thesis  has  proposed  several 
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new  algorithms  for  learning  LVMs  from  data  more  efficiently  and  accurately  than  previous 
methods,  and  has  also  proposed  a  novel  model  that  combines  desirable  properties  of  exist¬ 
ing  models  and  can  be  learned  with  strong  performance  guarantees.  The  thesis  contributes 
a  number  of  novel  ideas  for  addressing  fundamental  problems  in  sequential  data  modeling 
techniques.  For  example,  we  address  the  problems  of  model  selection  and  local  minima  in 
EM-based  HMM  learning  by  proposing  the  first  practical  top-down  HMM  model  selection 
algorithm  that  discovers  new  states  using  properties  of  the  observation  distribution  as  well 
as  the  temporal  patterns  in  the  data.  We  also  address  the  issue  of  instability,  proposing  a 
constraint-generation  algorithm  that  outperforms  previous  methods  in  both  efficiency  and 
accuracy.  Finally,  the  RR-HMM  is  a  hybrid  model  that  combines  the  discrete  spaces  and 
competitive  inhibition  of  HMMs  with  the  smooth  state  evolution  of  LDSs.  The  spectral 
learning  algorithm  we  propose  for  it  is  qualitatively  different  from  EM-based  methods, 
and  is  based  on  the  first  known  HMM  learning  algorithm  with  provable  performance  guar¬ 
antees.  We  extended  the  bounds  in  a  way  that  makes  them  significantly  tighter  for  cases 
where  the  actual  HMM  rank  is  much  lower  than  its  state  space  size.  This  model  and  its 
learning  algorithm  also  blurs  the  line  of  distinction  between  LVMs  and  PSRs  even  further. 

For  each  of  the  above  three  branches  of  this  thesis  we  demonstrated  experimental  re¬ 
sults  on  a  wide  variety  of  real-world  data  sets.  We  also  outlined  numerous  promising 
extensions  of  all  three  branches  earlier  in  this  chapter.  The  research  presented  in  this 
thesis  has  the  potential  to  positively  impact  a  variety  of  fields  of  application  where  se¬ 
quential  data  modeling  is  important,  such  as  robot  sensing  and  planning,  video  modeling 
in  computer  vision,  activity  recognition  and  user  modeling,  speech  recognition,  time  se¬ 
ries  forecasting  in  science  and  science,  bioinformatics,  and  more.  Many  of  the  extensions 
described  above  are  already  being  researched  and  implemented,  demonstrating  that  this 
thesis  lays  the  groundwork  for  several  fruitful  lines  of  current  and  future  research  in  the 
field  of  statistical  machine  learning  as  well  as  in  application  areas  such  as  robotics. 
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Appendix  A 


RR-HMM  Details 


This  appendix  contains  details  omitted  from  Chapter  6  including  learning  algorithm  per¬ 
formance  proof  details,  a  proof  sketch  regarding  learning  with  ambiguous  observations, 
and  parameter  values  for  synthetic  HMM  experiments. 


A.l  Proofs 

Proofs  for  theoretical  guarantees  on  the  RR-HMM  learning  algorithm  are  given  below, 
after  some  preliminary  conditions  and  lemmas. 

The  proof  of  Theorem  2  relies  on  Lemmas  18  and  24.  We  start  off  with  some  preliminary 
results  and  build  up  to  proving  the  main  theorem  and  its  lemmas  below. 

A  remark  on  norms:  The  notation  ||X||p  for  matrices  X  E  denotes  the  operator 

norm  max  for  vectors  ^  0.  Specifically,  ||A||2  denotes  L2  matrix  norm  (also  known 

as  spectral  norm),  which  corresponds  to  the  largest  singular  value  ai  (X) .  Frobenius  norm 

(  \ 

is  denoted  by  ||X||^  =  (  Xllii  ^ij)  ■  The  notation  \\X\\-^  for  matrices  denotes  the 
Li  matrix  norm  which  corresponds  to  maximum  absolute  column  sum  maxc^™^  |Xjc|. 
The  definition  of  ||a;||p  for  vectors  a;  G  is  the  standard  distance  measure  (X]r=i 
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A.  1.1  Preliminaries 


The  following  conditions  are  assumed  by  the  main  theorems  and  algorithms. 

Condition  3  [Modification  ofHKZ  Condition  7/  vr  >  0  element-wise,  T  has  rank  k  (i.e. 
R  and  S  both  have  rank  k )  and  O  has  rank  at  least  k. 

The  following  two  conditions  on  R  can  always  be  satisfied  by  scaling  down  entries  in 
R  and  scaling  up  S  accordingly.  However  we  want  entries  in  R  to  be  as  large  as  possible 
under  the  two  conditions  below,  so  that  akiW^OR)  is  large  and  l/akiW^OR)  is  small  to 
make  the  error  bound  as  tight  as  possible  (Theorem  2).  Hence  we  pay  for  scaling  down  R 
by  loosening  the  error  bound  we  obtain  for  a  given  number  of  training  samples. 

Condition  4  ||i?||^  <  1. 

Condition  5  For  some  column  1  <  c  <  k  of  R,  it  is  the  case  that  ||7?[-,  c]  II2  <  ^yk/m. 

The  above  two  conditions  on  R  ensure  the  bounds  go  through  largely  unchanged  from 
HKZ  aside  from  the  improvement  due  to  low  rank  k.  The  first  condition  can  be  satisfied  in 
a  variety  of  ways  without  loss  of  generality,  e.g.  by  choosing  the  columns  of  R  to  be  any 
k  independent  columns  of  T,  and  S  to  be  the  coefficients  needed  to  reconstruct  T  from 
R.  Intuitively,  the  first  condition  implies  that  R  does  not  overly  magnify  the  magnitude 
of  vectors  it  multiplies  with.  The  second  one  implies  a  certain  degree  of  uniformity  in  at 
least  one  of  the  columns  of  R.  For  example,  the  uniform  distribution  in  a  column  of  R 
would  satisfy  the  constraint,  whereas  a  column  of  the  identity  matrix  would  not.  This  does 
not  imply  that  T  must  have  a  similarly  near-uniform  column.  We  can  form  R  from  the 
uniform  distribution  along  with  some  independent  columns  of  T. 

The  observable  representation  depends  on  a  matrix  U  G  that  obeys  the  following 
condition: 

Condition  6  [Modification  ofHKZ  Condition  2  [  U'^OR  is  invertible. 
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This  is  analogous  to  the  HKZ  invertibility  condition  on  t/^0,  since  OR  is  the  matrix  that 
yields  observation  probabilities  from  a  low-dimensional  state  vector.  Hence,  U  defines  a 
fc-dimensional  subspace  that  preserves  the  low -dimensional  state  dynamics  regardless  of 
the  number  of  states  m. 


Condition  7  Assume  that  S  diag(7f)0^  has  full  row  rank(i.e.  k). 


This  condition  amounts  to  ensuring  that  that  the  ranges  S  and  O,  which  are  both  at 
least  rank  k,  overlap  enough  to  preserve  the  dynamics  when  mapping  down  to  the  low¬ 
dimensional  state.  As  in  HKZ,  the  left  singular  vectors  of  P2,i  give  us  a  valid  U  matrix. 

Lemma  8  [Modificationof  HKZ  Lemma  2]  Assume  Conditions  3  and?.  Tfien,  rank(P2,i)  = 
k.  Also,  ifU  is  the  matrix  of  left  singular  vectors  of  P2,i  corresponding  to  non-zero  singu¬ 
lar  values,  then  range(P)  =  range(OP),  so  U  E  obeys  Condition  6. 

Proof:  From  its  definition,  we  can  show  P2,i  can  be  written  as  a  low-rank  product  of 
RR-HMM  parameters: 


[P2,i]i,i  =  Pr[a:2  =  i,Xi=  j] 


m  m 


(marginalizing  hidden  states  h) 


a=l  6=1 


m  m 


Pr[x2  =  i\h2  =  a]  Pr[/i2  = 


a\hi  =  b]  Pr[xi  =  j\hi  =  b]  FT[hi  =  b] 


a=l  6=1 


m  m 


OiaTabMO\j 


a=l  b=l 


P2y  =  OT  diag(7f)0''' 

=  OPS  diag(7f)0''' 


(A.l) 


Thus  range(P2,i)  C  range(OP).  This  shows  that  rank (P2,i)  <  rank(OP). 
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By  Condition  7,  S'  diag(7f)(9''^  has  full  row  rank,  thus  S'  diag(7f)0''^(S'  diag(7f)(9''^)+ = 
Jfcxfc-  Therefore, 


OR  =  P2,iiS  diag(7f)0^)+  (A.2) 

whieh  implies  range(O-R)  C  range(P2,i)>  whieh  in  turn  implies  rank(OP)  <  rank(P2,i)- 

Together  this  proves  that  rank(P2  i)  =  rank  (OP),  whieh  we  ean  show  to  be  k  as 
follows:  Condition  3  implies  that  rank  (P^OP)  =  k,  and  henee  rank(OP)  >  k.  Sinee 
OR  G  rank(OP)  <  k.  Therefore,  rank(OP)  =  k.  Henee  rank(P2,i)  =  k. 

Since  range(P)  =  range(P2,i)  by  definition  of  singular  vectors,  this  implies  range(P)  = 
range(OP).  Therefore,  (U'^OR)  is  invertible  and  hence  U  obeys  Condition  6.  □ 

The  following  lemma  shows  that  the  observable  representation  {boo,  &i,  Pi,  •  •  • ,  Bn} 
is  linearly  related  to  the  true  HMM  parameters,  and  can  compute  the  probability  of  any 
sequence  of  observations. 

Lemma  9  [Modification  ofHKZ  Lemma  3]  (Observable  HMM  Representation).  Assume 
Condition  3  on  the  RR-HMM  and  Condition  6  on  the  matrix  U  G  Then,  the  observ¬ 

able  representation  of  the  RR-HMM  (Definition  1  of  the  paper)  has  the  following  proper¬ 
ties: 

1.  =  (U'^OR)7ri  =  (pTO)7r, 

2.  =  llR{U-’OR)-K 

3.  Forallx  =  l,...,n:  =  {U'^OR)W^{U'^OR)-^ 

4.  For  any  time  t:  f 

Proof: 
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1.  We  can  write  Pi  as  Ott,  since 


[Pi]i  =  Pr[a;i  =  i] 

m 

=  Pr[a;i  =  i\hi  =  a]  Pr[/z.i  =  a] 

a=l 

m 

^  ^  OiaT^a 

a=l 

Combined  with  the  fact  that  bi  =  U^Pi  by  definition,  this  proves  the  first  claim. 

2.  Firstly  note  that  P^  =  l^T  diag(7r)0^,  since 

PT  =  vfT^T 

=  diag(7f)0^ 

=  diag(7f)0'^  (since  i^T  =  1^) 

This  allows  us  to  write  Pi  in  the  following  form: 

f7  =  Zt  dmgizo^ 

=  r^-R'S'  diag(7r)(9"'' 

=  ZR{U^0R)-\U'^0R)S  diag(7r)OT 
=  ZR{U^ORZU^P2,i  (by  equation  (A.l)) 

By  equation  (A.l),  (7^P2,i  =  {U'^OR)S  diag(7f)0^.  Since  (U'^OR)  is  invertible 
by  Condition  6,  and  S  diag(7f)0^  has  full  row  rank  by  Condition  7,  we  know  that 
{U'^P2,i)~^  exists  and 

P'^P2,i(P'^P2,l)+=4xfc  (A.3) 

Therefore, 

bl  =  Pl{U^P2,iZ  =  llR{U^OR)-\U^P2,i){U^P2,ir  =  ZPiU^ORZ 

hence  proving  the  second  claim. 
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3.  The  third  claim  can  be  proven  by  first  expressing  as  a  product  of  RR-HMM 
parameters: 

[P3,x,i]ij  =  Pr[a:3  =  i,X2=  x,  Xi  =  j] 

m  m  m 

Pr[a:3  =  i,X2  =  x,  Xi  =  j,  =  a,h2  =  b,  hi  =  c] 

a=l  b=l  c=l 
m  m  m 

=  EEE  Pr[x3  =  i\h^  =  a]  PT[hs  =  a\h2  =  b]  Pr[a;2  =  x\h2  =  b] 

a=l  6=1  c=l 

Pr[/i2  =  b\hi  =  c]  Pr[/ii  =  c]  Pr[xi  =  j\hi  =  c] 

m  m  m 

=  EEE  Oia[Ax]abTbc^c[0\j 

a=l  b=l  c=l 

P3,x,i  =  OAxT  diag(7f)0’^ 

This  can  be  transformed  as  follows: 

P3,x,i  =  OAxRS  diag(7r)0'^ 

=  OAxR{U^OR)-\U^OR)S  diag(7f)0^ 

=  OAxR{U^OR)-^U^{OT  diag(7f)OT) 

=  OAxR{U^OR)-^U^P2,i  (by  equation  (A.l)) 


and  then  plugging  in  this  expression  into  the  definition  of  Bx,  we  obtain  the  required 
result: 


Bx  =  iU^ P3,x,i){U^ P2,iy  (by  definition) 

=  {U^O)AxR{U^OR)-\U^  P2,i){U^  P2,iy 
=  {U^O)AxR{U^OR)-^  (by  equation  (A.3)) 
=  {U^OR)  {S  dmg{Ox,)R)  iU^OR)-^ 

=  {U^OR)Wx{U'^OR)-^ 
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4.  Using  the  above  three  results,  the  fourth  elaim  follows  from  equation  1  in  Seetion  2 
in  the  paper: 

Pr[a:i, 

= 

=  llR{U^OR)-\U^OR)W^,{U^OR)-\U^OR)W^,_,{U^OR)-^ . . . 

. . .  {U'^OR)W^,{U^OR)-\U^OR)iri 

=  ^B^Ji 


□ 

In  addition  to  bi  above,  we  define  normalized  eonditional  ‘internal  states’  bt  that  help  us 
eompute  eonditional  probabilities.  These  internal  states  are  not  probabilities.  In  eontrast 
to  HKZ  where  these  internal  states  are  m-dimensional  veetors,  in  our  ease  the  internal 
states  are  fc-dimensional  i.e.  they  eorrespond  to  the  rank  of  the  HMM.  As  shown  above  in 
Lemma  9, 

b^  =  {U^OR)ni  =  {U^O)n 


In  addition  for  any  t  >  1,  given  observations  Xi:t-i  with  non-zero  probability,  the  internal 
state  is  defined  as: 


bt 


bt{,xi.,t-i 


Bxt-i.i^i 


(A.4) 


For  f  =  1  the  formula  is  still  eonsistent  sinee  blJji  =  ^{U'^OR)tti  = 

v^R^i  =  rL,7f  =  1. 


Reeall  that  HMM  and  RR-HMM  parameters  ean  be  used  to  ealeulate  joint  probabilities 
as  follows: 


Pr[a;i,  ...,Xt]  = 

=  ll^RS  diag(Oa:t,-).R5'  diag(Oa:t_i,-).R  ■  ■  ■ -S'  diag(03:i,-)^ 

=  1^/2  (S'  diag(0^,,.)i?)  (S'  diag(0^,_,,.)i?)  ■  ■  ■  (S'  diag(0^,,.)i?)  S'vf 
=  ll^RWx, . . .  Wx^^i  (by  definition  of  ^i)  (A.5) 
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The  following  Lemma  shows  that  the  eonditional  internal  states  are  linearly  related,  and 
also  shows  how  we  ean  use  them  to  eompute  eonditional  probabilities. 

Lemma  10  [Modification  of  HKZ  Lemma  4]  (Conditional  Internal  States)  Assume  the 
conditions  of  Lemma  9  hold,  i.e.  Conditions  3  and  6  hold.  Then,  for  any  time  t: 

1.  (Recursive  update)  If  . . . ,  >0,  then 

h+i  —  ^ ^ 

2.  (Relation  to  hidden  states) 

ht  =  (U^OR)lt{xi..t-i)  =  (U^O)ht(xv.t-i) 

where  [ht(xi-,t-i)]i  =  Pr[/it  =  i\xi-,t-i\  is  defined  as  the  conditional  probability 
of  the  hidden  state  at  time  t  given  observations  xi-,t-i,  and  lt(xi-,t-i)  is  its  low¬ 
dimensional  projection  such  that  hfixi.t-i)  =  Rlfixi-t-i). 

3.  (Conditional  observation  probabilities) 

Pr[xt|xi:f_i]  = 


Proof:  The  first  proof  is  direet,  the  seeond  follows  by  induetion. 


1.  The  t  =  2  ease  bo  = 


is  true  by  definition  (equation  A. 4).  For  t  >  3,  again 

by  definition  of  bt+i  we  have 

Bxt.i  bi 


bt+i  — 


BxtBxt-iabl 


bi 

Bxt  bt 

fir  D 

^ocJ^xt  gT  B 


blo^xt^xt-i-.ibi 

(by  equation  (A.4)) 


oo  a:+_i.ibi 


Bxt 

bZoBxM 


(by  equation  (A.4)) 
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2,3.  The  base  case  for  claim  2  holds  by  Lemma  9,  since  hi  =  tt,  li  =  Rtt  and  hi  = 
{U'^OR)tt.  For  claim  3,  the  base  case  holds  since  by 

Lemma  9,  which  equals  Pr[a;i]  by  equation  (A. 5).  The  inductive  step  is: 


ht+i  =  (by  claim  1  above) 

B^AU^OR)lt  „  .  ,  .  ,  ,  .  , 

=  — — ^ — I - ^  (by  inductive  hypothesis) 


{U^OR)WJt 

Pr[a:t|a:i:t_i] 

Pr[a;t|a;i:t_i] 


(by  Lemma  9) 


(•.•  RW^.lt  =  RS  (Ra.g{0:,,.)Rlt  =  A^Jt) 


Now  by  definition  of  A^tht, 


(U^O) 

(U^O) 


PT[ht+l  =  ■,Xt\xi-t-l] 

Pr[a:4|a:i:t_i] 

Pr[ht+i  =  -10:1:^]  FT[Xt\Xi:t-l\ 
Pr[a:t|a:i:t_i] 


=  {U^O)ht+iixi..t) 
=  {U^OR)h+i{xi.,t) 


This  proves  claim  2,  using  which  we  can  complete  the  proof  for  claim  3: 

^^B,^^,bt+i  =  ZRiU^OR)-\U^OR)W,,{U^OR)-%+i  (by  Lemma  9) 
=  i^RW^^{U'^OR)-\U^OR)lt+i  (by  claim  2  above) 

=  l^RWxJt+i 

=  TJ^RS  diag{Oxt,.)Rlt+i  (by  definition  of  W4t) 

=  diag{Ox,,)ht+i 

— ^  I  ^ 
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Again  by  definition  of 

m  m 

^oc^^t+l^t+l  EE  FT[xt+i\ht+i  =  a]  Pr[/it+i  =  a\ht  =  b]  Pr[/it  =  b\xi., 

a=l  b=l 
m  m 

=  '^'^  ht+i  =  a,ht  =  b\xi.,t] 

a=l  b=l 

=  Pr[a;t+i|a;i:t] 


□ 

Remark  11  If  U  is  the  matrix  of  left  singular  vectors  o/P2,i  corresponding  to  non-zero 
singular  values,  then  U  is  the  observable-representation  analogue  of  the  observation  prob¬ 
ability  matrix  O  in  the  sense  that,  given  a  conditional  state  bt,  Pr[a;t  =  i\xi:t-i\  =  [Ubt]i 
in  the  same  way  as  Fi[xt  =  i\xi,t-i\  =  [O/i^jj/or  a  conditional  hidden  state  ht. 

Proof:  Since  range(t/)  =  range(OP)  (Lemma  2),  and  UU'^  is  a  projection  operator  to 
range(f/),  we  have  UU^OR  =  OR,  so  Ubt  =  U{U^OR)lt  =  ORh  =  Ohf  □ 


A.  1.2  Matrix  Perturbation  Theory 

We  take  a  diversion  to  matrix  perturbation  theory  and  state  some  standard  theorems  from 
Steward  and  Sun  (1990)  [105]  and  Wedin  (1972)  [106]  which  we  will  use,  and  also  prove  a 
result  from  these  theorems.  The  following  lemma  bounds  the  L2-norm  difference  between 
the  pseudoinverse  of  a  matrix  and  the  pseudoinverse  of  its  perturbation. 


Lemma  12  (Theorem  3.8  of  Stewart  and  Sun  (1990)  [105])  Let  A  G  with  m  >n, 

and  let  A  =  A  +  E.  Then, 


A+  -  A+ 


^1  +  ^5 

<  - ■  max 

2  -  2 


\E\ 


The  following  lemma  bounds  the  absolute  differences  between  the  singular  values  of  a 
matrix  and  its  perturbation. 
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Lemma  13  (Theorem  4.11  of  Stewart  and  Sun  ( 1990)  [  105]).  Let  A  G  with  m  >  n, 
and  let  A  =  A  +  E.  If  the  singular  values  of  A  and  A  are  (cxi  >  . . .  >  ct„)  and  (di  > 
. . .  >  dn),  respectively,  then 


Wi  -  (Til  <  \\E\\^  i  =  1,. ..  ,n 

Before  the  next  lemma  we  must  define  the  notion  of  canonical  angles  between  two  sub- 
spaees: 

Definition  14  (Adapted  from  definition  4.35  of  Stewart  (1998)  [107])  Let  X  and  Y  be 
matrices  whose  columns  comprise  orthonormal  bases  of  two  p-dimensional  subspaces  X 
and  y  respectively.  Let  the  singular  values  of  X^Y  (where  denotes  the  conjugate 
transpose,  or  Hermitian,  of  matrix  X)  be  71, 72, . . . ,  7p.  Then  the  eanonieal  angles  6i 
between  X  and  y  are  defined  by 

6'i  =  cos“^7j,  i  =  l,2,  ...,p 

The  matrix  of  canonical  angles  0  is  defined  as 

Q{X,y)  =  diag(0i,02,...,0p) 

Note  that  Vz  7*  G  [0, 1]  in  the  above  definition,  since  71  (assuming  it’s  the  highest  singular 
value)  is  no  greater  than  cti(X^)cti(F)  <1-1  =  1,  and  hence  cos“^  7*  is  always  well- 
defined. 

For  any  matrix  A,  define  A±  to  be  the  orthogonal  complement  of  the  subspace  spanned 
by  the  columns  of  A.  For  example,  any  subset  of  left  singular  vectors  of  a  matrix  comprise 
the  orthogonal  complement  of  the  matrix  composed  of  the  remaining  left  singular  vectors. 
The  following  lemma  gives  us  a  convenient  way  of  calculating  the  sines  of  the  canonical 
angles  between  two  subspaces  using  orthogonal  complements: 

Lemma  15  (Theorem  4.37  of  Stewart  (1998)  [107])  Let  X  and  Y  be  n  x  p  matrices, 
with  n  >  p,  whose  columns  comprise  orthonormal  bases  of  two  p-dimensional  subspaces 
X  and  y  respectively.  Assume  X±,Y±  G  such  that  [X  X^]  and  [Y  Y±\  are 
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orthogonal  matrices.  The  singular  values  of  Yj^  X  are  the  sines  of  the  canonical  angles 
between  X  and  y. 

The  following  lemma  bounds  the  L2-norm  difference  between  the  sine  of  the  canonical 
angle  matrices  of  the  range  of  a  matrix  and  its  perturbation. 

Lemma  16  ([106],  Theorem  4.4  of  Stewart  and  Sun  (1990)  [105]).  Let  A  G  with 
m>  n,  with  the  singular  value  decomposition  (Ui,  U2,  U3,  Si,  S2,  Vi,  V2): 


'UJ- 

■  Si 

0 

UJ 

A 

Vi 

V2 

= 

0 

S2 

.UJ  _ 

0 

0 

Let  A  =  A  +  E,  with  analogous  SVD  (Ui,  U2,  U3,  Si,  S2,  S3,  Vi,  V2).  Let  $  be  the  matrix 
of  canonical  angles  between  range(f/i)  and  range(f/i),  and  0  be  the  matrix  of  canon¬ 
ical  angles  between  range (Vi)  and  range (Vi).  If  there  exists  5  >  0,  a  >  0  such  that 
mincr(Si)  >  a  +  6  and  maxcr(S2)  <  a,  then 

max  {||sin  $||2  ,  ||sin  0II2}  < 

The  above  two  lemmas  can  be  adapted  to  prove  that  Corollary  22  of  HKZ  holds  for  the 
low-rank  case  as  well,  assuming  that  the  perturbation  is  bounded  by  a  number  less  than  cr^. 
The  following  lemma  shows  that  (1)  the  singular  value  of  a  matrix  and  its  perturbation 
are  close  to  each  other,  and  (2)  that  the  subspace  spanned  by  the  first  k  singular  vectors  of 
a  matrix  is  nearly  orthogonal  to  the  subspace  spanned  by  the  (fc  +  . . . ,  singular 

vectors  of  its  perturbation,  with  the  matrix  product  of  their  bases  being  bounded. 

Corollary  17  [Modification  of  HKZ  Corollary  22]  Let  A  G  with  m  >  n,  have 

rank  k  <  n,  and  let  U  G  be  the  matrix  of  k  left  singular  vectors  corresponding  to 

the  non-zero  singular  values  cri  >  . . .  >  cr^  >  0  o/A.  Let  A  =  A  +  E.  Let  U  G 
be  the  matrix  of  k  left  singular  vectors  corresponding  to  the  largest  k  singular  values 
?i  >  •  •  •  >  of  A,  and  let  U±  G  (m-k)  remaining  left  singular  vectors. 

Assume  \\E\\2  <  eak  for  some  e  <  1.  Then: 
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2 


-  e)crfc- 

<  \\E\\2l^k- 


1.  CTfc  >  (1 


2. 


ulu 


Proof: 


1.  From  Lemma  13, 


-  o-fcl  <  II-EII2 

O'fc  — 

ckk  —  CTk^  —eak 

>  (1  -  e)afc 


which  proves  the  first  claim. 


2.  Recall  that  by  Lemma  15,  if  $  is  a  matrix  of  all  canonical  angles  between  range(P2,i) 
and  range (P2,i)>  then  sin  <h  contains  all  the  singular  values  of  UjU  along  its  diago¬ 
nal. 

Also  recall  that  the  L2  norm  of  a  matrix  is  its  top  singular  value.  Then, 


I  sin  $  1 1 2  =  di  (sin  <h)  (by  definition) 

=  max  diag(sin  $)  (since  sin  $  is  a  diagonal  matrix) 
=  (Ti(PjP)  (by  Lemma  15) 


UJU 


(by  definition) 


Invoking  Lemma  16  with  the  parameter  values  S  =  ak  and  a  =  0  yields  ||sin  <h||  2  — 


IIPII2  /(Jk.  Combining  this  with  ||sin<I)||2  =  (UjU) 


proves  claim  2. 


□ 


A.1.3  Supporting  Lemmas 

In  this  section  we  develop  the  main  supporting  lemmas  that  help  us  prove  Theorem  2 
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Estimation  Errors 


We  define  ei,e2,i  and  as  sampling  errors  for  Pi,P2,i 


and  Pz,x,i  respeetively: 


ei 


£2,1 

£3,x,1 


Pi -Pi 


P2,l  —  -P2,] 


P'i,x,l  P'i,x,^ 


for  X  =  1, 


(A.6a) 

(A.6b) 

(A.60) 


Lemma  18  [Modification  of  HKZ  Lemma  8]  If  the  algorithm  independently  samples  N 
observation  triples  from  the  HMM,  then  with  probability  at  least  1  —  p: 


Before  proving  this  lemma,  we  need  some  definitions  and  a  preliminary  result.  First,  we 
restate  McDiarmid’s  Inequality  [108]: 

Theorem  19  Let  Zi, . . . ,  be  independent  random  variables  all  taking  values  in  the 
set  Z.  Let  Ci  be  some  positive  real  numbers.  Further,  let  f  :  Z"^  ^  M.  be  a  function  of 
Zi, ,  Zm  that  satisfies  Wi,  \/zi, . . . ,  Zm,  &  Z, 

\f{Zi,...,Zi,...,Zm)  -  f{Zi,.  .  .  ,Zi,.  .  .  ,Zm)\  <  Ci. 

Then  for  all  e  >  0, 

Pr[/  -  E[/]  >  e]  <  exp  . 
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Assume  z  is  a  discrete  random  variable  that  takes  on  values  in  1, . . . ,  d.  The  goal  is  to 
estimate  the  vector  q  =  [Pr(z  =  from  N  i.i.d.  samples  Zi  {i  =  1, . . . ,  N).  Let  ej 

denote  the  column  of  the  dxd  identity  matrix.  For  i  =  1, . . . ,  N,  suppose  glisa  column 
of  the  dxd  identity  matrix  such  that  =  e^..  In  other  words,  the  zf^  component  of  qi 
is  1  and  the  rest  are  0.  Then  the  empirical  estimate  of  qm  terms  of  qtisq  =  Qi/^- 

Each  part  of  Lemma  1 8  corresponds  to  bounding,  for  some  q,  the  quantity 

We  first  state  a  result  based  on  McDiarmid’s  inequality  (Theorem  19): 

Proposition  20  [Modification  of  HKZ  Proposition  19]  For  all  e  >  0  and  g,  q  and  N  as 
defined  above: 

Pr  ^||g  —  g||2  >  1/s/N  +  e j 

Proof:  Recall  q  =  define  p  =  Jff^iPi/N  where  p)  =  except  for 

i  =  k,  and  p^  is  an  arbitrary  column  of  the  appropriate- sized  identity  matrix.  Then  we 
have 


II?  -  ^l2  -  Up  -  ^l2  <  II?  -  ^l2  (by  triangle  inequality) 

i  ^2 

=  (l/A^)  llgfc  —  pfc||2  (by  definition  of  p,  q  and  L2-norm) 

<  {l/N)Vl^  +  l^ 

=  V2/N 

This  shows  that  Wq  —  qW^isa  function  of  random  variables  g), . . . ,  g^v  such  changing  the 
random  variable  g^  for  any  1  <  fc  <  (resulting  in  ||p  —  g||2)  changes  the  value  of 
the  function  by  at  most  =  \/2/N.  Note  that  g  is  not  a  random  variable  but  rather  the 
variable  we  are  trying  to  estimate.  In  this  case,  McDiarmid’s  inequality  (Theorem  19) 
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bounds  the  deviation  ||g  —  ^|2  from  its  expectation  E  ||g  —  ^[2  as: 

-2e2 

Pr(||g  -  ^|2  >  E  \\q-  +  e)  <  exp 

Ei=i  c- 


=  exp 


=  e 


N  ■  2/iV2 

-Ne^ 


(A.7) 


We  can  bound  the  expected  value  using  the  following  inequality: 


E 


N 


^qi-Nq 


i=l 


=  E 


<  E 


N 

'^qi-Nq 

2  =  1 

2/ 

N 

'^qi-Nq 

2=1 

. 

1/2 


1/2 


(by  concavity  of  square  root,  and  Jensens  inequality) 


N 


1/2 


<  2=1 
N 


1/2 


^E(gi  -  ^'^{qi  -  ^ 


.  2=1 


Multiplying  out  and  using  linearity  of  expectation  and  properties  of  ^  (namely,  that  ^ qi  = 
1,  E(gj)  =  q  and  g  is  constant),  we  get: 


E 


N 


-  Nq 


i=l 


N 


1/2 


<  2^7 +11^12)  (since  =  1) 


2  \  *=i 

N 


N  N  \ 

y]  E(i)  -  2  y]  E(f7,-) + y]  E  ii9ii^ 

,  i=l  2=1  2  =  1 

2\  1/2 


=  (/v-2/vi|jii:^  +  /v  11,119 
=  v'a'(i  -  Il9ll9 
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This  implies  an  upper  bound  on  the  expeeted  value: 


E||g-^|2 


{1/N^)E 


'^qi-Nq 
i=l  2 


<  (i/iv^) .  ]V(1  -  ||5l|^) 
^  E  Ilf  -  fll^  <  (l/yjV)y(l-||fl|^) 

<  {1/Vn) 


Using  this  upper  bound  in  MeDiarmids  inequality  (equation  (A.7)),  we  get  a  looser  version 
of  the  bound  that  proves  the  proposition: 

Pr(||g  -  ^|2  >  I/'/N  +  e)  < 


□ 

We  are  now  ready  to  prove  Lemma  18. 

Proof: [Lemma  18]  We  will  treat  Pi,P2,i  and  Pz,x,i  as  veetors,  and  use  MeDiarmid’s 
inequality  to  bound  the  error  in  estimating  a  distribution  over  a  simplex  based  on  indieator 
veetor  samples,  using  Proposition  20.  We  know  that 

Pr  (^||g-g1l2  >  yW  +  e)  <  . 

Now  let  T]  =  This  implies 

Inr^  =  —Ne^ 
ln(l/r7)  =  Ne^ 

e  =  \/\\i{l/ri)/N 

Henee, 

Pr  (^||g  -  ^|2  >  +  v^RiTW^)  <  V 

Therefore,  with  probability  at  least  1  —  rf, 

||g  -  gl2  <  l/v^+  ^\n{lh)/N  (A.8) 
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Now,  in  place  of  g  in  equation  (A. 8),  we  substitute  the  stoehastie  veetor  Pi  to  prove  the  first 
elaim,  the  veetorized  version  of  the  stoehastie  matrix  P2,i  to  prove  the  seeond  elaim,  and 
the  veetorized  version  of  the  stoehastie  tensor  ^3,2,1  £  obtained  by  staeking  Ps^x,! 

matriees  over  all  x,  to  prove  the  third  elaim.  The  matriees  Ps^x,!  are  staeked  aeeordingly 
to  obtain  the  estimated  tensor  ^3,2,1-  We  get  the  following: 


Cl  <  I/Vn  +  ^/\ia.{l/ri)/N  (henee  proving  the  first  elaim) 
^2,1  <  I/s/N  +  ^/\n{l/ri)/N  (henee  proving  the  seeond  elaim) 

max  63, 3;, 1  <  l^es^x,!^ 


11-^3, x,l  —  P3,x,l 


X  i  j 


P 


3,2,1  —  -'3,2,1 


-^3,2,1  —  -^3,2,1 

<  \/l/N  +  ^/\n{l/ri)/N  (henee  proving  the  third  elaim) 


Note  the  following  useful  inequality  from  the  above  proof: 


<  VW+  VHl/ri)/N 


(A.9) 


It  remains  to  prove  the  fourth  elaim,  regarding  ^  bound  that 

depends  on  n  as  follows: 

£3,1,1  =  |£3,x,i|  (■.'  Va:,  63^3;^!  >  0) 


< 


(•.•  VT  G  M“,  ||x||^  <  ^/a  ||a;||2) 


<  s/n/N  + 


1 

V 
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We  aren’t  going  to  use  the  above  bound.  Instead,  if  n  is  large  and  N  small,  this  bound 
ean  be  improved  by  removing  direct  dependence  on  n.  Let  e{k)  be  the  sum  of  smallest 
n  —  k  probabilities  of  the  second  observation  X2.  Let  Sk  be  the  set  of  these  n  —  k  such 
observations  x,  for  any  k.  Therefore, 

<>=)  =  E  Pr[a:2  =  x]  =  E  Ei^3.j. 

x&Sk  x&Sk  i,j 

Now,  first  note  that  we  can  bound  as  follows: 

<^3,0:, 1  <  |e3,x,l| 

x^Sk 

<  (■-■  Vf  e  M“,  ||f||i  <  ^||f||2) 

Y  x^Sk 

By  combining  with  equation  A. 9,  we  get 


es,.,!  <  +  ^k\Yi{l/7])/N  (A.  10) 

x^Sk 


To  bound  we  first  apply  equation  (A. 8)  again.  Consider  the  vector  q  of  length 

kn^  +  1  whose  first  kri^  entries  comprise  the  elements  of  for  all  x  ^  Sk,  and  whose 
last  entry  is  the  cumulative  sum  of  elements  of  for  all  x  G  Sk-  Define  q  accordingly 
with  Ps^a;^  instead  of  Ps,^,!.  Now  equation  (A. 8)  directly  gives  us  with  probability  at  least 
1  —  r]: 


xfSk  hj 


x^Sk 


2 

p3,3;,l 

~  P3,x,l 

+ 

F 

E  JSiSr.ih  -  E  EiPi>..ai«; 

xeSk  i,j  x&Sk  i,j 

2 


<  ^/l/N+^\n{l/r])/N 


yY  X^([-P3,x,l]ii  -  [P3,x,l\ij) 
x&Sk  i,j 


<  (^/ljN+^\n{l/r])/N 


Since  the  first  term  above  is  positive,  we  get 


yY  y^i[^3,x,i]ij 

x&Sk  i,j 


[P3,x,l\ij) 


<  ^/l/N+^/Hl/r])/N 


(A.  11) 
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Now,  by  definition  of  S'^:, 


-P3,x,l  “  -^3,^,] 


x£Sk 


xeSk 

—  ^  ^  ^  ^  I  [-f3,a;,l]ij  ~  [P3,x,l\ij  (■  Vx,  ||x||2  <  il^^lli) 

x&Sk  i,j 

-EE  max  (^0,  [P3,x,i]ij  -  [P3,x,i]ij'^ 
x£Sk  i,j 

-EE  min  ^0,  [P3,x,i]ij  -  [P3,x,i]ij^  (■-■  Vf,  \x\  =  [max(0,x)  -  min(0,x)]) 

xeSk  i,j 

max  (^0,  [P3,x,i]ij  -  [P3,x,i\ij 

xeSk  ij 

min  ^0,  [-P3,3:,l]2j  “1“  E  Ei^’-J. 

x&Sk  i,j  x&Sk  ^,j 

-EE  max  ^0,  [P3,x,i\ij  -  [P3,x,i]ij^  +  e{k) 

x&Sk  i,j 

+  EE  min  (^0,  [P3,x,i]ij  -  [^^3,x,i]ii)  +  e{k)  (by  definition  of  e{k)) 

x&Sk  i,j 


y^iP3,x,i\ij 

x£Sk  i,j 


< 


y^  {[P3,x,i]ij  -  [P3,x,i]ij 
x&Sk  i,j 


+  2e(/c) 


Plugging  in  equation  (A.l  1),  we  get  a  bound  on  ^3,x,i- 

y^  e3,x,i  <  ^/l/N  +  ^/\n{l/r])/N  +  2e{k) 

xeSfc 

Combining  with  equation  (A.  10)  and  noting  that  k  is  arbitrary,  we  get  the  desired  bound: 

e3,x,i  <  min[^/Hn(T7y7^  +  \/k/N  +  /N  +  \/l/N  +  2e(/c)] 


Note  that,  to  get  the  term  ln(3/?7)  instead  of  ln(l/?7)  as  in  the  fourth  claim,  we  simply 
use  ri/3  instead  of  rj.  This  bound  on  ^3,x,i  will  be  small  if  the  number  of  frequently 
occurring  observations  is  small,  even  if  n  itself  is  large.  □ 

The  next  lemma  uses  the  perturbation  bound  in  Corollary  17  to  bound  the  effect  of 
sampling  error  on  the  estimate  U,  and  on  the  conditioning  of  (U'^OR). 
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Lemma  21  [Modification  of  HKZ Lemma  9]  Suppose  €2^1  <  e-ak{P2,i)  for  some  e  <  1/2. 
Let  £0  =  ^2,1/ ~  £)o'k{,P2,i)Y  ■  Define  U,U  &  matrices  of  the  first  k  left 

singular  vectors  o/P2,i,  -^2,1  respectively.  Let  9i, ...  ,9k  be  the  canonical  angles  between 
span(f/)  and  span(P).  Then: 

1.  £0  <  1 

2.  ak{U^P2,i)  >  {l-e)ak{P2,i) 

3.  <Jk{U'^P2,l)  P  a/I  —  £oO'kiP2,l) 

4.  akiU^OR)  > 

Proof:  First  some  additional  definitions  and  notation.  Define  U±_  to  be  the  remaining 
n  —  k  left  singular  veetors  of  P2,i  corresponding  to  the  lower  n  —  k  singular  values,  and 
correspondingly  Lf±_  for  P24.  Suppose  LfYdY'^  =  P2y  is  the  thin  SVD  of  P2,i.  Finally,  we 
use  the  notation  Ui{A}  E  W  to  denote  the  right  singular  vector  of  a  matrix  A  E 
Recall  that  a/A)  =  ||74Fj{74}||2  by  definition. 

First  claim:  £0  <  1  follows  from  the  assumptions: 

_  62,1^ 

~  {{I  -  e)ak{P2,i)fi 
^  eW{P2,if 

-  {1  -  efiak{P2,i  f 

-  {1-efi 

<  1  (since  e  <  1/2) 

Second  claim:  By  Corollary  17,  crfc(P2,i)  >  (1  —  ^)o'fc(-P2,i)-  The  second  claim  follows 
from  noting  that  crA:(P^P2,i)  =  crfc(P2,i). 
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Third  and  fourth  claims:  First  consider  the  singular  value  of  U'^U .  For  any  veetor 


X  G 


d/c. 


U^Ux 

U^Uy 

—  >  min  - 

2 

\X\ 


2  ^  \\y\\2 

*  T 


=  ak{U'  U)  (by  definition  of  smallest  singular  value) 
=  cos{9k)  (by  Definition  14) 

=  \Jl-  sin2(6'fc) 


=  \Jl  —  ak{UjUy  (by  Lemma  15) 


uju 


=  A  1 


(by  definition  of  L2  matrix  norm) 


Therefore, 


Note  that 


U^Ux 


2 


<  e2//(Tk{P2,i  f  (by  Corollary  17) 

621^ 

<  - ^ - K  (sinee  0  <  £  <  1/2) 

-  (l-£)V,(P2,l)' 

=  £0  (by  definition) 


(A.12) 


Henee,  by  eombining  the  above  with  equation  (A.12),  sinee  0  <  £0  <  1: 


U^Ux 


>  WxW^V^-So 


(for  all  X  G  M^) 


(A.13) 


The  remaining  elaims  follow  by  taking  different  ehoiees  of  x  in  equation  (A.13),  and  by 
using  the  intuition  that  the  smallest  singular  value  of  a  matrix  is  the  smallest  possible 
L2  norm  of  a  unit-length  veetor  after  the  matrix  has  left-multiplied  that  veetor,  and  the 
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particular  vector  for  which  this  holds  is  the  corresponding  right  singular  veetor.  For  claim 
3,  let  X  =  Then  by  equation  (A.13): 


U^UT.V^Vk{U^P2,i}  >  T.V^Vk{U^P2,i} 


Vl 


Sinee  Pa.i  =  UTy^ ,  and  \\Ty^ Vk{PV^}\\^  <  Vk{U^ P2,i} 

we  have: 


by  definition  of 


(^k{U^P2,i)  >  afc(Sl/'^)Vl  -  £0  (by  definition  of  afc(f7'^P2,i),  afc(Sl/'^)) 
0'k{u'^ P2,l)  >  Crfc(P2^i)\/l  —  £0  (■.■  O'ki^V'^)  =  Crk{P2,l)) 


whieh  proves  claim  3. 

For  claim  4,  first  recall  that  OR  can  be  exaetly  expressed  as  P2,i(*S'  diag(7f)0^)+ 
(equation  (A. 2)).  For  brevity,  let  ^  =  (S'  diag(7f)(9^)+,  so  that  OR  =  P2,i^.  Then,  let 
X  =  T.V'^AukiU'^OR}  in  equation  (A.13): 


U^UEV^AukiU  OR}  ^  >  EV^Auk{U^OR} 


£0 


Since  P2,i  =  UEV^,  and  \\EV^APk{^V^A}\\^  <  EV^APk{U^OR} 
tion  of  i'ki'^V'^A},  we  get: 


by  defini- 


U^P2,iAi7k{U'^OR}  >  \\EV'^Ai7k{^V'^A}\LVl^o 


U'^ORi7k{U'^OR}  >  aki'^V'^ A)\/l  —  £0  (by  equation  (A. 2)) 
2 


By  definition  of  (jfc(P^OP),  ak{'PV'^ A),  we  see  that 
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cTkiU^OR)  >  aki^V^A)Vl^o 

(TkiU'^OR)  >  ak{OR)Vl^o  (•■•  cTki^V^A)  =  ak{P2,iA)  =  ak{OR)) 

hence  proving  claim  4.  □ 

Define  the  following  observable  representation  using  U  =  U  ,  which  constitutes  a  true 
observable  representation  for  the  HMM  as  long  as  (U'^OR)  is  invertible: 

&00  =  {P2/u)+P,  =  Vr^ 

B.  =  iU^P3,.,i)iU^P2,iV  =  iU^OR)W,{U^OR)-^  for  a;  =  1, . . .  ,n 
hi  =  U^Pi 

Define  the  following  error  measures  of  estimated  parameters  with  respect  to  the  true  ob¬ 
servable  representation.  The  error  vector  in  (5i  is  projected  to  MR  before  applying  the 
vector  norm,  for  convenience  in  later  theorems. 

<500=  {U^Op{b^-b^) 

oo 

A,  =  \\{U'^OR)-^  (4  -  5.)  {U^OR)\\^  =  \\{U'^OR)-^B,{U^OR)  - 
A  =  5^  A, 

X 

\^R{U^OR)-\bi-bi)  R{U^0R)-%-7r  ^ 

The  next  Lemma  proves  that  the  estimated  parameters  h^cB^Ai  close  to  the  true 
parameters  6oo,  -Bx,  &i  if  the  sampling  errors  ei,  62,1,  are  small: 

Lemma  22  [Modification  of  HKZ Lemma  10]  Assume  e2,i  <  <Jk{P2,i) /^-  Then: 
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^oo  <  4  ■ 


62,1 


,  8 

A  .,  <  ^  ■ 


0'k{,P2,l] 

Vk 


+ 


61 


A  <  ^  ■ 


^3  (Tk{OR) 

8  Vk 


8o'k{P2,l 
Vl[x2  =  x] 
62,1 


62,1 


+ 


aVP2,iy 


+ 


^^63, 


xA 


3cTfc(P2,l 


73  aVOR)  \(Jk{P2,if  3crfc(P2,i) 


^i<^- 


Vk 


a/3  (Tk{OR) 


■  61 


Proof:  Note  that  the  assumption  on  €2,1  guarantees  (U'^OR)  to  be  invertible  by  Lemma  21, 
claim  4. 


(5oo  bound: 


We  first  see  that  6^0  can  be  bounded  by 


5 


00 


{0'^U){b^ 


boo) 


00 


< 

|o^lL  £'('>» 

< 

U  (boo  ^00) 

< 

U  (boo  ^00) 

< 

boo  ^00 

2 
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In  turn,  this  leads  to  the  following  expression: 


< 

< 


{PLU)+Pi  - 


{Pl^UyPl  -  (P2,l  '  P)  +  Pl  +  (P2,l  '  t/)  +  Pl  -  (P2,l  '  UyPi 


-  (A,i  uy)  Pi  +  (P2,i '  p)+(Pi  -  Pi) 


{{P21UV  -  (P2,l  '  P)  +  )Pl  +  (P2,l  '  P)  +  (Pl  -  Pi) 

’  o 


((P2',iP)+-(P2,i'P)+) 


Pi 


(P2,/P)+  (Pi  -  Pi) 


The  last  step  above  obtains  from  the  eonsisteney  of  the  L2  matrix  norm  with  Li  veetor 
norm,  and  from  the  definition  of  L2  matrix  norm  (speetral  norm)  as  ||v4||2  =  max 
Now,  reeall  that  U  has  orthonormal  eolumns,  and  henee  multiplying  a  matrix  with  U 
eannot  inerease  its  speetral  norm.  Henee, 


^2^1^  -^2/^  =  (P2[i-P2/)P  <  Ph-P, 


2,1 


—  ^2,1. 


So,  we  ean  use  Lemma  12  to  bound  the  L2-distanee  between  pseudoinverses  of  P^^iU  and 
P2,i^P  using  €2,1  as  an  upper  bound  on  the  differenee  between  the  matriees  themselves. 
Also  reeall  that  singular  values  of  the  pseudoinverse  of  a  matrix  are  the  reeiproeals  of  the 
matrix  singular  values.  Substituting  this  in  the  above  expression,  along  with  the  faets  that 


(Tk{P2,lU)  =  (Tk{P2,l), 


Pi 


=  1  and 


(Pi  -  Pi 


=  ei,  gives  us: 


^00  ^00 


<  1  + 

2  -  2 


^2,1 


mm 


+ 


ei 


inU(P2,i),afc(P2/P))  ^k{P2,i  U) 


Now,  to  simplify  the  last  expression  further,  eonsider  Lemma  21  in  the  above  eontext. 
Here,  €2,1  <  o-fc(P2,i)/3  and  henee  e  =  1/3.  Therefore  afc(P2,i)  =  o-fc(P'''P2,i)  > 
(2/3)crfc(P2,i)  and  crfc(P’^ P2,i)  >  VI  -  £ocrfc(P2,i)-  Henee 


mm 


c^fc(P2,i),  crfc(P2,/P))  =  o-fe(P2,i)^  ■  min(2/3,  a/1  - 
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The  latter  term  is  larger  since 


_  £2,1 
“  {{I  -  e)a,{P,,,)y 

^  crk{P2,if/Q 
~  4(Tfc(P2,l)V9 
=  1/4 


>  v/3/2  >  2/3 


Therefore  min  (^ak{P2,i),o'k{P2,i^U)^  >  (Tfc(P2,i)^(2/3)^.  Plugging  this  into  the  expres¬ 
sion  above  along  with  the  fact  that  ak{U'^P2,i)  >  {V3/2)ak{P2,i),  we  prove  the  required 
result  for  6^0  • 

^  ^  1  +  9e2,i  2ei 

“  "  2  ■  4ak{P2,if  VSak{P2,i) 

£2,1  £1 

0'fc(P2,l)^  0-fc(-P2,l) 


Aa:,A  bounds:  .  We  first  bound  each  term  A^;  by  ^/k  / aMU'^ OR))\ 

2 

A,  =  ||(pTOP)-i  (b,-B^)  {U'^0R)\\^ 

<  OR)~^{Bx  —  Bx)U'^  l|f^-R|li  (by  norm  consistency) 

<  Vk  iU'^ OR)~^{Bx  —  B^)U'^  (by  Li  vs.  L2  norm  inequality) 

2 

<Vk  iU'^ORY^  Bx  —  Bx  ||0||J|P|L  (by  norm  consistency) 

2  2  2 

<  v^||(P'^OP)-'||J|p,-P,||^  (||pT||^j|0||^,||P||^  <  1) 

=  Vk  P.-P.  /(TkiU^OR)  a^^U^OR)-^  =  l/a^,^{U^OR)) 

2 
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The  term 


Bx  -  Bx 


in  the  numerator  ean  be  bounded  by 


Bx  -  Bx 


P,,x^){U^ P2^)^  -  {U^P:,,.,l){U^P2,l)- 


< 


iU^Pw)  [iU^P2,iy  -  iU^P2,l)' 

1  +  \/5  62,1 


+ 


<  11-^3, x,l 


<  Pr[a:2  =  x] 


{P3,.,l-P3,.,l]{U^P2,ly 

^3,x,l 


mm 


- ixt - 1 - - 

CTk{P2,l),CTk{U^P2,l)y  ^k{U^P2,l) 


1  +  ^ 


^2,1 


mm 


+ 


^3,x,l 


[(Tk{P2,l).(Tk{U^P2,l)) 

where  the  seeond  inequality  is  from  Lemma  12  and  the  last  one  uses  the  faet  that 


||-P3,a;,lll2  -  llA.x.lllp  -  [-P3,x,l] j,,-  <  [-P3,x,l]i,j  “  '^^^2  -  x\. 

y  i,j  i,j 

Applying  Lemma  21  as  in  the  (5oo  bound  above,  gives  us  the  required  result  on  A^,.  Sum¬ 
ming  both  sides  over  x  results  in  the  required  bound  on  A. 


(5i  bound:  For  we  invoke  Condition  4  to  use  the  faet  that  ||i?|| ^  <  1.  Speeifieally, 


^1  = 


R{U'OR)-^U'{Pi-  Pi) 


<||i?||i  {U^OR)-^U^{Pi-Pi) 


(norm  eonsisteney) 


<Vk\\R\\^  {U^OR)-^U^{Pi-  Pi)  (||x||^  <  y^||a;||2foranya:  e  M”) 


<Vk\\R\\^  {U^OR)-^U^  {Pi -Pi) 


(norm  eonsisteney) 


<v^||i?||^  {U^OR)-^ 
\fkei 


■  ei  (defn.  of  ei,  f/^  has  orthogonal  eolumns) 


njTr^  DN  ^2-norm) 

ak{U^OR) 

The  desired  bound  on  ()i  is  obtained  by  using  Lemma  21.  With  e,  Eq  as  deseribed  in  the 
above  proof  for  we  have  that  akiW^OR)  >  {\/3/2)ak{U'^OR).  The  required  bound 
follows  by  plugging  this  inequality  into  the  above  upper  bound  for  ()i.  □ 
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A.1.4  Proof  of  Theorem  2 


The  following  Lemmas  23  and  24  together  with  Lemmas  18,21,22  above,  eonstitute  the 
proof  of  Theorem  2  on  joint  probability  accuracy.  We  state  the  results  based  on  appropriate 
modifications  of  HKZ,  and  provide  complete  proofs.  We  also  describe  how  the  proofs  gen¬ 
eralize  to  the  case  of  handling  continuous  observations  using  Kernel  Density  Estimation 
(KDE).  Eirst,  define  the  following  as  in  HKZ 

e{i)  =  min  <  Pr[x2  =  j]  :  S'  C  {1 . . .  n},  [S'!  =  n  —  i 
[j&s 


and  let 


no(£)  =  min{i  :  e{i)  <  e} 

The  term  no{e),  which  occurs  in  the  theorem  statement,  can  be  interpreted  as  the  mini¬ 
mum  number  of  discrete  observations  that  accounts  for  1  —  e  of  total  marginal  observation 
probability  mass.  Since  this  can  be  much  lower  than  (and  independent  of)  n  in  many  ap¬ 
plications,  the  analysis  of  HKZ  is  able  to  use  no  instead  of  n  in  the  sample  complexity 
bound.  This  is  useful  in  domains  with  large  n,  and  our  relaxation  of  HKZ  preserves  this 
advantageous  property. 

The  following  lemma  quantifies  how  estimation  errors  accumulate  while  computing 
the  joint  probability  of  a  length  t  sequence,  due  to  errors  in  Bj.  and  b. 


Lemma  23  [Modification  of  HKZ  Lemma  11]  Assume  U'^OR  is  invertible.  For  any  time 


t: 


Xl.t 


<  (1  +  A)*5i  +  (1  +  A)*-1 


Proof:  Proof  by  induction.  The  base  case  for  t 


0,  i.e.  that 


< 


is  true  by  definition  of  ^i.  Eor  the  rest,  define  unnormalized  states  bt  =  bfixi.t-i)  = 
and  bt  =  bfixi.t-i)  =  For  some  particular  t  >  1,  assume  the  inductive 

hypothesis  as  follows 
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1 


J2\\Riu^0R)-^(bt-h'^ 


<  {i  +  A)%  +  {i  +  Ay-i 


The  sum  over  xi:t  in  the  LHS  can  be  decomposed  as: 

Xl:t 

=  J2Y1  \\Riu''0R)-^  ((5,,  -  B^ybt  +  (4,  -  B,y{bt  -bt)  +  B^,^t  -h))  11^ 

X 


Using  triangle  inequality,  the  above  sum  is  bounded  by 

\\r{U^OR)-^  (fix*  -5,*)  (fi^O)||^  \\r{U^OR)-%\\^ 

xt  Xi:t-1 

+  J2Y1  \\RiU^OR)-^  (fix* -fix*)  {U^O)\\Jr{U^OR)-^  (bt-^W^ 

xt  Xi:t-1 

+  ^  ^  I  R{U^OR)-^Bt{U^OR){U^OR)-^  (bt-b^ 

Xt  Xi:t-i 


Each  of  the  above  double  sums  is  bounded  separately.  For  the  first,  we  note  that  R{U'^OR)~^bt 
Pr[a;i:t_i],  which  sums  to  1  over  The  remainder  of  the  double  sum  is  bounded  by  A, 
by  definition.  For  the  second  double  sum,  the  inner  sum  over  R(U'^ OR)~^{bt  —  bt)  is 
bounded  using  the  inductive  hypothesis.  The  outer  sum  scales  this  bound  by  A,  by  defini¬ 
tion.  Hence  the  second  double  sum  is  bounded  by  A((l+A)*“^(5i+(1+A)*“^— 1).  Finally, 
we  deal  with  the  third  double  sum  as  follows.  We  first  replace  {U'^OR)~^Bt{U'^OR)  by 
Wxt,  and  note  that  R  ■  Wx^  =  Ax^R.  Since  Ax^.  is  entry-wise  nonnegative  by  definition, 
Px*n||i  <  iyAx^\v\,  where  |F|  denotes  element-wise  absolute  value.  Also  note  that 
Sx*  ^  =  ^W\  =  lli^lli-  Using  this  result  with  F  =  R{U'^ O R)~^ {bt  — 

bt)  in  the  third  double  sum  above,  the  inductive  hypothesis  bounds  the  double  sum  by 
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(1  +  A)*  +  (1  +  A)*  ^  —  1.  Combining  these  three  bounds  gives  us  the  required  result: 

^l:t 

<  A  +  A((l  +  Ay-^6,  +  (1  +  A)*-^  -  1)  +  (1  +  A)'-!^!  +  (1  +  A)*-i  -  1 
=  A  +  (1  +  A)((l  +  A)‘-^5i  +  (1  +  A)*-^  -  1) 

=  A  +  (1  +  A)*5i  +  (1  +  A)*  -  1  -  A 
=  (1  +  A)*5i  +  (1  +  A)*  -  1 

thus  eompleting  the  induetion.  □ 

The  following  lemma  bounds  the  effeet  of  errors  in  the  normalizer  boo. 


Lemma  24  [Modification  of  HKZ  Lemma  12]  Assume  e2^i  <  c’'fc(-P2,i)/3.  Then  for  any  t, 


^  |pr[xi:t]  -  Pr[xi:i] 

Xl-.t 


<  (1  +  (5oo)(l  +  <5i)(l  +  A)*  —  1 


Proof:  First  note  that  the  upper  bound  on  €2,1  along  with  Lemma  2 1 ,  ensure  that  akifJ'^OR)  > 
0  and  so  {U'^OR)  is  invertible.  The  LHS  above  ean  be  deeomposed  into  three  sums  that 
are  dealt  with  separately: 


^  Pr[xi:f]  -  Pr[a:i:f]  = 


bi-P 


X\-.t 


<  |(&oo  -  booy{U^OR){U^OR)-^B^,fir 

Xl:t 

+  |(feoo  -booy{U^OR){U^OR)-\B^J^  -  B^Ji) 

Xl,t 

+  Y.  '[UU^OR)(U^OK)-\B,J,  -  B^Ji) 

Xl:t 


The  first  sum  ean  be  bounded  as  follows,  using  Holders  inequality  and  bounds  from 
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Lemma  22: 


|(&oo  -  b^y{U^OR){U^OR)-^B,,,b, 

Xl:t 


<  ||(f7'^0)'^(6oo 

^l:t 


—  ^  ^  ^OO  ll^3;t:l'^|ll 

Xl:t 

=  y^(5ooPr[a:i:t] 

Xl:t 


RiU^OR)-^B,,,b, 


=  6 


OO 


The  second  sum  can  be  bounded  also  using  Holders,  as  well  as  the  bound  in  Lemma  23: 


|(&oo  -  b^y{U^OR){U^OR)-\B,J,  -  B,,,b,) 

Xl:t 


< 


(U'O)'  (6oo  -  feoo)  R{U'OR)-\B,,^b,  -  B,M 


^  ^oo((l  +  +  (1  +  A)*  —  1) 


The  third  sum  again  uses  Lemma  23: 


J2  \l>l,{U'’OR){U^OR)-^{B,J,  - 

Xl:t 


\^R{U^OR)-\B,J,  - 


Xl:t 


< 


R{U'OR)-\B^,,b,-B,,^^b,) 


<  (1  +  A)‘5i  +  (1  +  A)*-1 


Adding  these  three  sums  gives  us: 


|Pr[2:i;t]  -  Pr[2:i:4] 

Xl:t 


A  boo  +  <^oo((l  +  A)*(5i  +  (1  +  A)* 


1)  +  (1  +  A)*5i  +  (1  +  A)‘  -  1 


A  ^oo  +  (1  +  ^oo)((l  +  A)*5i  +  (1  +  A)*  —  1) 


which  is  the  required  bound.  □ 

Proof: (Theorem  2).  Assume  N  and  e  as  in  the  theorem  statement: 
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£  =  ak{0R)ak{P2,i)(^/{^t^/k) 


N>C-^- 


akiORyakiP2,i)^ 


+ 


k  ■  no{e) 


ak{0Rfak{P2,i? 


■  log(l/^) 


First  note  that 


|Pr[a:i:t]  -  Pr[xi:f] 

^i-.t 


<  2 


since  it  is  the  Li  difference  between  two  stochastic  vectors.  Therefore,  the  theorem  is 
vacuous  for  e  >  2.  Hence  we  can  assume 


e  <  1 

in  the  proof  and  let  the  constant  C  absorb  the  factor  4  difference  due  to  the  1  /e^  term  in 
the  expression  for  N. 

The  proof  has  three  steps.  We  first  list  these  steps  then  prove  them  below. 

First  step:  for  a  suitable  constant  C,  the  following  sampling  error  bounds  follow  from 
Lemma  18: 

Cl  <  min  ^.05  ■  (3/8)  ■  ak{P2,i)  ■  e,  .05  ■  (v^/2)  ■  ak{OP)  ■  (I/a/^)  ■  ej  (A. 14a) 

62,1  <  min  ^.05  ■  (1/8)  ■  crfc(P2,i)^  ■  (e/5),  .01  ■  (a/3/8)  ■  (Tk{OR)  ■  (Tk{P2,if  ■  {l/{tVk))  ■  ej 

(A.  14b) 

63,.,!  <  0.39  ■  (3^/8)  ■  ak{OR)  ■  ak{P2,i)  ■  (l/(fv^))  ■  e  (A.14c) 

X 

Second  step:  Lemma  22  together  with  equations  (A.  14)  imply: 

5oo  <  .05e  (A.  15a) 

<  .05e  (A.  15b) 

A  <  0.4e/f  (A.15c) 

Third  step:  By  Lemma  24,  equations  (A.  15)  and  the  inequality 

(l  +  (a/f))*  <  l  +  2a  for  a  <1/2  (A.16) 
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we  get  the  theorem  statement. 

Proof  of  first  step:  Note  that  for  any  value  of  matrix  P2,i>  we  can  upper-bound  (Jk{P2,i) 
by  1: 


(yk{P2,l)  <  0-l(-P2,l) 


=  max  IIP2  la^l 


n  /  n 


1/2 


<  max  > 

||a;|L  =  l 

"  i=i 


j=l  \i=l 
n 

'^^[P2,\\ijXi 


2  =  1 


(by  norm  inequality) 


<  EE  |[-P2,i]ij|  (\xi\  <  1  since  ||a;||2  =  1) 

i=l  i=l 

n  n 

=  ^  ^ [P2,i]ij  (by  non-negativity  of  P2,i) 

j=i  i=i 

=  1  (by  definition) 


Similarly,  for  any  column-stochastic  observation  probability  matrix  O  we  can  bound  ak{OR) 
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by  ^/k.  First  see  that  cti(0)  < 


m: 


(Ti(0)  =  max  ||0a;| 


=  max 


j=i  i=i 

m  n  \ 

2 


1/2 


Elia'S, 


\x\\^  =  1  ^  \Xi\  <  1) 


vi=i  *=i 


1/2 


< 


(by  triangle  inequality) 


m  n 

j=l  i=l 

1/2 

<  I  ^  ]  (by  definition  of  O) 

v/  =  l 


=  vm 


Now  the  bound  on  akiOR)  follows  from  Condition  5  i.e.  akiOR)  <  \/k/m: 


(Jk{OR)  =  min  ||Oi?x||2 
Il^ll2  =  l 

<  IIOII2  ■  min  II-RXII2  (by  norm  eonsisteney) 


<  vm  min 


m  k 


XjY  (•.•  i|24||2  =  CTii^A)  for  any  matrix  ^4) 

\j  i=i  i=i 


Assume  the  eolumn  of  R  obeys  Condition  5  for  some  1  <  c  <  fc.  Also  assume  x  =  Cc, 
the  eolumn  of  the  /c  x  fc  identity  matrix,  whieh  obeys  the  eonstraint  ||a;||2  =  1-  Then 
every  eomponent  of  the  inner  sum  is  zero  exeept  when  j  =  c,  and  the  min  expression  ean 
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only  get  larger: 


akiOR)  <  \rm 


\  i=i 


=  y/m\\R[-,c\\\^ 
<  y/m\/k/m 

=  Vk 


henee  proving  that  ak{OR)  <  \/k. 


Now  we  begin  the  proof  with  the  ei  ease.  Choose  a  C  that  satisfies  all  previous  bounds 
and  also  obeys  {\/C /A)  ■  0.05  ■  (3/8)  >  1. 


<  \/l/N{^\n.{?>/r])  +  1)  (by  Lemma  18) 

<  (sinee  ^J\\i{?t/ri)  >  -y/lnS  >  1) 


(A.17) 
(A.  18) 


Now,  plugging  in  the  assumed  value  of  N : 

2e{ak{P2,ifcTk{OR)) 


ei  < 


/ln(3/r7) 


(A.  19) 


Ck{l  +  no{e)ak{P2,if)  V 

Any  substitutions  that  inerease  the  right  hand  side  of  the  above  inequality  preserve  the 
inequality.  We  now  drop  the  additive  1  in  the  denominator,  replaee  ^y\n{3 / rj) /  \n{l/ rj)  by 
2  sinee  it  is  at  most  Vln3,  and  drop  the  faetors  t,  \/no{e)  from  the  denominator. 

4ak{OR)ak{P2,ife 


ei  < 


VCkak{P2,i] 


(A.20) 


< 


■4- 


< 


1 


mm 


akiOR)/Vk  ■  [akiP2,i)]  ■  e 
4  ■  crfc(P2,i)  ■  e,4  ■  ak{OR)  -l/Vk- 


(•.•  both 


ak{OR)/\fk  and  [crfc(P2,i)]  are  <  1) 


=  —  min  ^0.05  ■  3/8  ■  afc(P2,i)  ■  e,  0.05  ■  ■\/3/2  ■  auiOR)  ■  l/\/~k  ■  ej 
(for  C  =  ^-  0.05  ■  3/8) 


=  mm 


(^0.05  ■  3/8  ■  afc(P2,i)  ■  e,  0.05  ■  ^3/2  ■  ak{OR)  ■  l/Vk  ■  e)  (•.•  C  >  1) 
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Hence  proving  the  required  bound  for  ei. 

Next  we  prove  the  62,1  case.  Choose  a  C  that  satisfies  all  previous  bounds  also  obeys 
{VC/A)  ■  0.01  ■  (^3/8)  >  1.  Note  that,  since  the  bound  on  62,1  in  Lemma  18  is  the  same 
as  for  Cl,  we  can  start  with  the  analogue  of  equation  (A.  19): 


2e(crfc(^2,i)Vfc(Q^))  /ln(3/?7) 

Ck{l  +  nQ{e)ak{P2,if)  v 


We  now  drop  the  additive  nQ{e)ak{P2,if  in  the  denominator,  again  replace  \/\n{?>/ri)/ \n{l/r]) 
by  2  since  it  is  at  most  Vln3,  and  drop  the  multiplicative  factor  t  from  the  denominator. 


^2,1  <  ■  4  ■  akiP2,if  ■  cTkiOR)  ■  (I/a/^)  ■ 


1 

7^ 


4-  [ak{P2,if]  ■  \ak{OR)/Vk 


< 


Vc 

1 


mm 


4  ■  akiP2,if  ■  C  4  ■  akiOR)  ■  akiP2,if  ■  l/7k  ■  e)  (•.•  k(Oi?)/v^ 


<1) 


<  ^min  (^0.05  ■  1/8  ■  ak{P2,if  ■  e,0.01  ■  a/3/8  ■  ak{OR)  ■  ak{P2,if  ■  I/a/^  ■  ej 


(for  C  =  (VC/4)  ■  0.01  ■  (\/3/8)) 

<  min  ^05  ■  1/8  ■  (Tk{P2,if‘  ■  0.01  ■  a/3/8  ■  ak{OR)  ■  ak{P2,iY  '  ^/7k  ■  ej  (since  C  >  1) 


hence  proving  the  bound  on  62,1. 

Finally  for  Yhx  assume  C  such  that  >  1  in  addition  to  previous 

requirements  on  C.  we  first  restate  the  bound  from  Lemma  18: 

^  e3,x,i  <  niin  [TJJn  +  1)  +  2e(j))  +  +  1) 

X 

<  7n^^(e)jN  +  l)  +  2e(no(£))  +  \/l/N  +  ij 

<  \/l/N  ^-y/hiST^  +  1  j  (no(e:)  +  1)  +  25  (since  e(no(5))  <  5) 
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The  first  two  terms  are  exaetly  as  before,  so  we  perform  the  same  steps  as  in  equa¬ 
tions  (A.17)-(A.20)  exeept  we  do  not  drop  to  get: 


Ckno{e)(Tk{P2,i 

<  .  (2  .  no{e))  +  2ak{OR)ak{P2,i)e/ 4:t\fk 

t-^Cknoye) 

(sinee  1  +  no(e:)  <  2  •  no{e),  and  plugging  in  e) 


<  ak{OR)  ■  ak{P2,i)  ■  tVk  ■  e  ■  ^8/ VC  +  1/2^ 

<  ^0.39  ■  (3^3/8)  ■  ak{OR)  ■  ak{P2,i)  ■  ■  e  (for  C  = 

<  0.39  ■  (3^3/8)  ■  (Jk{OR)  ■  <^k{P2,i)  ■  tVk  ■  e  (sinee  C"  >  1  by  assumption) 


Henee  proving  the  required  bound  for  ^3,x,i- 

Proof  of  seeond  step:  Substituting  from  equation  (A.  14)  into  (5i  in  Lemma  22: 


.  /  2  Vk 
^ 


< 


y/^ak{OR) 
y/k 
^/^ak{OR) 


73 


min  (  .05  ■  -crfc(P2,i)e,  -05  ■  —ak{OR)-^e 


=  .05e  ■  min 
<  .05e 


a/3  Vk 
T  ak(OR) 


(^k(P2,l),  1 
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Substituting  from  equation  (A.  14)  into  ^oo  in  Lemma  22: 


^oo  <  4 


< 


^2,1 


+ 


ei 


(^k{P2,if  3crfc(P2,i) 


<^k{P2,l) 

4 


mm 


f .05  ■  (1/8)  ■  ak{P2,if  ■  (e/5),  .01  ■  (^3/8)  ■  ^^(OP)  ■  au{P2^?  ■  (l/(fv^))  ■ 


mm 


in(  .05  ■  (3/8)  ■  (Tk{P2,i)  ■  e,  .05  ■  (^3/2)  ■  (Jk{OR)  ■  (1/v^)  • 


3crfc(P2,i 

<  min(  .05e,  .04  ■  (a/3/8)  ■  akiOR)  ■  {l/{t\/k))  ■  e 


+  min  (^.05  ■  (1/2)  ■  e,  .05  ■  (2/a/3)  ■  ■  (1/v^)  ■ 


C’'fc(P2,l 


<  .05e(.01  +  .5) 

<  .05e 


Substituting  from  equation  (A.  14)  into  A  in  Lemma  22: 


A  8 

A  <  —  ■ 


e2,i 


73  (Tk{.OR)  lafc(P2,i)  3afc(P2,i) 


< 


8\/k 


mm 


VSakiOR)  \crfc(P2,i)  V 
1 


f  .05  ■  (1/8)  ■  a7P2,i)'  ■  (6/5),  .01  ■  (73/8)  ■  a^OP)  ■  ak{P2,iY  ■ 


t'/k 


+ 


3crfc(P24) 


0.39  ■  (373/8)  ■  akiOR)  ■  a7P2,i)  ■  (l/(fv^))  ■ 


=  I  mini  .05  ■  (6/5)  - ,  .01  ■  +  0.39  ■  - 

'  '  ^^^73a70P)’  tj  t 


<.01--  +  0.39-- 
t  t 


<  OAe/t 


149 


Proof  of  third  step:  By  Lemma  24, 


^  |pr[xi:t]  -  Pr[xi:t] 

Xl:t 


< 


(1  +  (5cxd)(1  +  ^i)(l  +  A)* 


1 


<  (1  +  .05e)(l  +  .05e)(l  +  0.4e/f)*  —  1  (by  equations  (A. 15)) 

<  (1  +  .05e)(l  +  .05e)(l  +  0.8e)  —  1  (by  equation  (A.16),  sinee  0.4e  <  1/2) 
=  1  +  .05e  +  .05e  +  .OS^e  +  0.8e  +  Me^  +  Me^  +  (.05)^  •  .08e3  -  1 


=  .0002e^  +  .0825e^  +  0.9e 

<  (.0002  +  .0825  +  0.9)e  (sinee  e  <  1  by  assumption) 


=  0.9827e 


<  e 


This  eompletes  the  proof  of  Theorem  2. 


□ 


A.  1.5  Proof  of  Theorem  2  for  Continuous  Observations 

For  eontinuous  observations,  we  use  Kernel  Density  Estimation  (KDE)  [109]  to  model 
the  observation  probability  density  funetion  (PDE).  We  use  a  fraetion  of  the  training  data 
points  as  kernel  eenters,  plaeing  one  multivariate  Gaussian  kernel  at  eaeh  point.  ^  The 
KDE  estimator  of  the  observation  PDE  is  a  eonvex  eombination  of  these  kernels;  sinee 
eaeh  kernel  integrates  to  1,  this  estimator  also  integrates  to  1.  KDE  theory  [109]  tells 
us  that  as  the  number  of  kernel  eenters  and  the  number  of  samples  go  to  infinity  and  the 
kernel  bandwidth  goes  to  zero  (at  appropriate  rates),  the  KDE  estimator  eonverges  to  the 
observation  PDE  in  Li  norm.  The  kernel  density  estimator  is  eompletely  determined  by  the 
normalized  veetor  of  kernel  weights;  therefore,  if  we  ean  estimate  this  veetor  aeeurately, 
our  estimate  will  eonverge  to  the  observation  PDE  as  well. 

Henee  our  goal  is  to  prediet  the  eorreet  expeeted  value  of  this  normalized  kernel  vee¬ 
tor  given  all  past  observations  (or  more  preeisely,  given  the  appropriate  sequence  of  past 

'We  use  a  general  elliptical  covariance  matrix,  chosen  by  SVD;  that  is,  we  use  a  spherical  covariance 
after  projecting  onto  the  singular  vectors  and  scaling  by  the  square  roots  of  the  singular  values. 
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observations,  or  the  appropriate  indicative  events/features).  In  the  context  of  Theorem  2, 
joint  probability  estimates  for  f-length  observation  sequences  are  effectively  the  expecta¬ 
tion  of  entries  in  a  f-dimensional  tensor  formed  by  the  outer  product  of  t  indicator  vectors. 
When  we  move  to  KDE,  we  instead  estimate  the  expected  outer  product  of  t  stochastic 
vectors,  namely,  the  normalized  kernel  weights  at  each  time  step.  As  long  as  the  sum 
of  errors  in  estimating  entries  of  this  table  goes  to  zero  for  any  fixed  t  as  the  number  of 
samples  increases,  our  estimated  observation  PDFs  will  have  bounded  error. 

The  only  differences  in  the  proof  are  as  follows.  In  Lemma  18,  we  observe  qt  to  be 
stochastic  vectors  instead  of  indicator  vectors;  their  expectation  is  still  the  true  value  of  the 
quantity  we  are  trying  to  predict,  pi  are  also  stochastic  vectors  in  that  proof.  In  the  proof 
of  Proposition  20,  pk  is  an  arbitrary  stochastic  vector.  Also,  qjqi  <  ||gi||]^  =  1  now  instead 
of  being  always  equal  to  1,  and  the  same  holds  for  pjpj.  Also  ||pi  —  Pi||2  <  l|Pi  “Pilli  =  1 
(by  triangle  inequality).  Besides  these  things,  the  above  proof  goes  through  as  it  is. 

Note  that  in  the  continuous  observation  case,  there  are  continuously  many  observable 
operators  that  can  be  computed.  We  compute  one  base  operator  for  each  kernel  center, 
and  use  convex  combinations  of  these  base  operators  to  compute  observable  operators  as 
needed. 


A.2  Learning  with  Ambiguous  Observations:  Example 


When  stacking  observations,  the  modified,  larger  P2,i  ^  still  has  rank  at  most  k 

since  it  can  be  written  in  the  form  P2,i  =  GTH  for  some  matrices  G,  G  For 

example,  if  n  =  2  for  an  HMM  with  ambiguous  observations,  and  we  believe  stacking 
2  observations  per  timestep  will  yield  a  sufficiently  informative  observation,  the  new  ob¬ 
servation  space  will  consist  of  all  ff  =  =  4  possible  tuples  of  single  observations  and 

2  2 

P2,i  G  ,  with  each  observation  i  corresponding  to  a  tuple  <  p,  i2  >  of  the  original 
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observations.  Specifically, 

^2,1  (j,  i)  =  Pr(a:4  =  J2,  xz  =  ji,X2  =  I2,  Xi  =  ii) 

=  ^  Pr(a;4  =  J2,  X3  =  ji,X2  =  k,  xi  =  ii,  /i4  =  d,h3  =  c,  /i2  =  b,hi=  a) 

a,b,c,d 

O  j2dTdcO  jicTcbOi2bTl)aOi^a'^  a 

a,b,c,d 

^  O j,cTcb [  diRg(7r)0  '^b,i  where  Oj^c  ^  ^  O j2<^dcO j\c 

b,c  d 

P2,i  =  OT  diag(7r)0^ 

Similarly,  we  can  show  that  Ps^x,!  =  GTH^  for  some  matrices  G,  G  The  exact 

formulae  will  differ  for  different  choices  of  past  and  future  observable  statistics. 

A.3  Synthetic  Example  RR-HMM  Parameters 

Example  1 

■  0.3894  0.2371  0.3735 

T=  0.2371  0.4985  0.2644 

_  0.3735  0.2644  0.3621 

Example  2 

■  0.6736  0.0051  0.1639 

T  =  0.0330  0.8203  0.2577  O  = 

_  0.2935  0.1746  0.5784  _ 

Example  3 

’  0.7829  0.1036  0.0399  0.0736 
0.1036  0.4237  0.4262  0.0465 

T  = 

0.0399  0.4262  0.4380  0.0959 
0.0736  0.0465  0.0959  0.7840 
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A.4  Consistency  Result  for  Learning  with  Indicative  and 
Characteristic  Features 

Here  we  prove  consistency  of  the  RR-HMM  learning  algorithm  when  handling  arbitrary 
continuous  features  of  groups  of  observations,  termed  indicative  features  for  the  past  and 
characteristic  features  for  the  future.  As  before,  the  low -rank  transition  matrix  is  denoted 
by  T  =  RS.  The  observation  probability  model  is  O,  which  we  think  of  as  a  ’’matrix” 
whose  first  dimension  is  continuous  (corresponding  to  observations  xf)  and  whose  second 
dimension  is  number  of  states,  so  that  ’’columns”  are  probability  distributions.  The  sta¬ 
tistics  are  ,  W,  and  (for  past,  present,  and  future),  each  of  which  we  think  of  a 
’’matrix”  whose  first  dimension  is  (number  of  statistics)  and  whose  second  is  continuous 
(corresponding  to  observations),  so  that  ’’rows”  are  statistics  that  we  collect.  Wj  means  the 
row  of  W.  The  initial  distribution  over  states,  as  before,  is  tt  (a  vector  of  length  number 
of  states).  In  this  scenario,  the  quantities  Pi,  P2,i  and  Pz,x,i  no  longer  contain  probabilities 
but  rather  expected  values  of  singletons,  pairs  and  triples  of  features.  For  the  special  case 
of  features  that  are  indicator  variables  of  events,  we  recover  the  conventional  case  where 
Pi,  P2,i,  P3,x,i  consist  of  probabilities,  and  our  performance  bounds  hold  in  addition  to  the 
consistency  results  which  we  show  below  for  the  general  features  case.  In  the  derivation 
below,  we  consider  P^^xy  to  be  a  three-dimensional  tensor  rather  than  a  series  of  matrices, 
for  simplicity. 

With  this  notation,  we  have  the  following  expressions  for  P24  and  Ps,^,!  (see  Deriva¬ 
tions  subsection  below): 

P2,i  =  {W^OR){S  diag(7r)0'IT^^)  (A.22) 

J, :)  =  {W^OR){S  dmg{Wp)R){S  dmg{7:)0'W^^)  (A.23) 

(A.24) 

Factoring  P24  =  UV,  we  know  that  U  has  the  same  range  (column  span)  as  P2,i,  so 

U  =  {W^OR)A 

for  some  (square,  invertible)  matrix  A  in  the  low-rank  space.  Assuming  W^OR  has  full 
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column  rank,  and  writing  for  the  pseudo-inverse  of  U,  we  get 


t/+P2^^  =  A-^S  diag(7r)0'W"^’^  (A.25) 

J, :)  =  dia.g{W,0)R){S  dia.g{n)0'W^^)  (A.26) 


Assuming  S  diag(7r)OW^^ 


has  full  row  rank,  we  then  get 


B,  =  t/+P3,.,i(:,  J,  ■){U+P2,iY  =  A-\S  diag(iy,0)i?)A 


This  is  the  analog  of  the  original  result  about  our  learned  observable  operators:  Bj 
is  a  similarity  transform  away  from  S  (img{WjO)R.  However,  these  new  ’’observable 
operators”  aren’t  necessarily  what  we  need  for  tracking,  since  WjO  isn’t  necessarily  a 
probability  distribution.  It’s  possible  that  we  can  either  come  up  with  conditions  under 
which  the  new  ’’observable  operators”  are  really  what  we  need  for  tracking,  or  some  slight 
modification  of  the  statistics  collected  that  makes  them  so. 
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Derivations 


Derivations  for  P2,iiPi,,x,i  that  also  show  them  to  be  low -rank: 


P2,i{i,  k)  =  E(tf  {xt)tl{xt+i)) 

=  j  j  {xt)t^{xt+i)dxtdxt+i 

=  11 

J  J  u  u 


H+1) 


ht  ht+\ 

0{xt,  ht)0{xt+i,  ht+i)t^  {xt)t^{xt+i)dxtdxt+i 

/ /EEE  nht)R{Kzt)S{zt,  kt+i) 

ht  ht+i  zt 

0{xt,  ht)0{xt+i,  ht+i)t[ {xt)t^{xt+i)dxtdxt+i 
f’{ht)R{ht,  zt)0{xt,  ht)ti  {xt)dx^ 


Zt 


ht 


'^S{zt,  ht+i)0{xt+i,  ht+i)t^{xt+i)dxt+i 

ht+i 
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For  P3,x,i: 

P3,xAi,J,k)  =  E{t[ {xt+i)) 

=  jj  j  P{xt-i,xt,xt+i)ti{xt-i)tj{xt)tl{xt+i)dxt-idxtdxt+i 

=  ///EEE  nht-i)T{ht.^,ht)T{ht  )  ^t+i) 

ht-i  ht  ht+i 

0{xt-i,  ht-i)0{xt,  ht)0{xt+i,  ht+i)t[ {xt-i)tj{xt)t^{xt+i)dxt-idxtdxt+i 

=  ///EEEEE  F{ht-i)R{ht-i,  Zt-i)S{zt-i,  ht)R{ht,  Zt)S{zt,  ht+i) 

J  J  ht-l  ht  ht+lZt-l  Zt 

0(xt-i,  ht-i)0{xt,  ht)0{xt+i,  ht+i)t[ {xt-i)tj{xt)t^{xt+i)dxt-idxtdxt+i 

=  EE  /E  F{ht-i)R{ht_i,  zt-i)0{xt-i,  ht-i)t[ {xt-i)dxt-i 

Zt-I  Zt  y  ht-i  ^ 

^S{zt-1,  ht)R{ht,  Zt)0{xt,  ht)tj{xt)dxt 


'^Sizt,  ht+i)0{xt+i,  ht+i)t^ {xt+i)dxt+i 


'^t  +  l 


A.5  Consistency  Result  for  Learning  PSRs 

In  this  case,  R  and  S  don’t  exist  in  the  true  model,  and  there  is  no  matrix  O.  There  are  only 
observable  operators  [69]  which  we  will  eall  Mi,  a  normalization  vector  boa,  and  from 
these  we  ean  derive  update  veetors  bj  =  bJ^Mi.  Assume  for  now  that  the  observation 
features  are  single  diserete  observations,  and  that  there  is  only  a  single  action.  The  tests 
(eharaeteristie  events)  are  assumed  to  be  indicator  functions  of  the  next  single  observation 
so  the  expeeted  value  of  the  test  is  the  probability  of  seeing  that  observation.  The  histo¬ 
ries  (indicative  events)  similarly  corresponds  to  the  single  previous  observation,  i.e.  the 
PSR/OOM  is  assumed  to  be  1-step  observable  (see  [73]  for  a  generalization  of  this  proof 
to  general  indieative  and  eharaeteristie  events,  multiple  eontrol  inputs  and  continuous  ob- 
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servations).  Assume  i,  j,  k  are  the  indices  of  observations  Xt,  Xt+i,Xt+2  respectively.  Let  2; 
denote  the  hidden  state  at  time  t,  P{z)  denote  the  belief  over  2:,  and  ^  denote  J  zP{z)dz. 
For  P3: 

P-i{h3,  k)  =  f  P{z)P{xt  =  i\  ht  =  z)P{xt+i  =  j  \  ht  =  z)P{xt+2  =  k  \  Xt+i  =  j,  ht  =  z) 


= /  p(z)  {bj.)  (i,: 

=  j  P{z)blMjMizdz 

=  blMjMiZ 


-  ^  H  z 

k  uTMjZ 

P'  hlz 


dz 


Similarly,  for  P2: 


P2{i,j)  =  /  P{z)P{xt  =  i\ht  =  z)P{xt+i  =  j  \  ht  =  z) 


=  /  P{z)bj  Mizdz 


=  b]  MiZ 


Let  B  and  M  be  matrices  such  that: 


' 1 

- 

B  = 

—  — 

and  M  = 

■  ■  ■  MiZ  ■ 

nxk 

_  1 

- 

Then,  we  can  write 


P2  =  BM 
P,{:,j,:)  =  BM^M 


Factoring  P2  =  UV,  we  know  that  U  has  the  same  range  (column  span)  as  P2,  so 


U  =  BA 
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for  some  (square,  invertible)  matrix  A  in  the  low-rank  spaee.  Assuming  B  has  full  eolumn 
rank,  and  writing  for  the  pseudo-inverse  of  U,  we  get 

U^P2  =  A-^M  (A.27) 

U+Ps{-.,j,:)  =  A-^M^M  (A.28) 

Assuming  M  has  full  row  rank,  we  then  get 

B,  =  f/+P3(:,J,:)(^+P2)+  =  A-^M,A 

Bj  is  a  similarity  transform  away  from  Mj,  the  OOM  observable  operator. 
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